当前位置: X-MOL 学术Lang. Resour. Eval. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Sentiment analysis in Portuguese tweets: an evaluation of diverse word representation models
Language Resources and Evaluation ( IF 2.7 ) Pub Date : 2023-06-28 , DOI: 10.1007/s10579-023-09661-4
Daniela Vianna , Fernando Carneiro , Jonnathan Carvalho , Alexandre Plastino , Aline Paes

During the past years, we have seen a steady increase in the number of social networks worldwide. Among them, Twitter has consolidated its position as one of the most influential social platforms, with Brazilian Portuguese speakers holding the fifth position in the number of users. Due to the informal linguistic style of tweets, the discovery of information in such an environment poses a challenge to Natural Language Processing (NLP) tasks such as sentiment analysis. In this work, we state sentiment analysis as a binary (positive and negative) and multiclass (positive, negative, and neutral) classification task at the Portuguese-written tweet level. Following a feature extraction approach, embeddings are initially gathered for a tweet and then given as input to learning a classifier. This study was designed to evaluate the effectiveness of different word representations, from the original pre-trained language model to continued pre-training strategies, to improve the predictive performance of sentiment classification, using three different classifier algorithms and eight Portuguese tweets datasets. Because of the lack of a language model specific to Brazilian Portuguese tweets, we have expanded our evaluation to consider six different embeddings: fastText, GloVe, Word2Vec, BERT-multilingual (mBERT), BERTweet, and BERTimbau. The experiments showed that embeddings trained from scratch solely using the target Portuguese language, BERTimbau, outperform the static representations, fastText, GloVe, and Word2Vec, and the Transformer-based models BERT multilingual and BERTweet. In addition, we show that extracting the contextualized embedding without any adjustment to the pre-trained language model is the best approach for most datasets.



中文翻译:

葡萄牙语推文中的情感分析:对不同单词表示模型的评估

在过去的几年里,我们看到全球社交网络的数量稳步增长。其中,Twitter巩固了其作为最具影响力社交平台之一的地位,巴西葡萄牙语用户数量排名第五。由于推文的非正式语言风格,在这种环境中发现信息对情感分析等自然语言处理(NLP)任务提出了挑战。在这项工作中,我们将情感分析描述为葡萄牙语推文级别的二元(正面和负面)和多类(正面、负面和中性)分类任务。采用特征提取方法,首先收集推文的嵌入,然后作为学习分类器的输入。本研究旨在评估不同单词表示的有效性,从原始的预训练语言模型到持续的预训练策略,以提高情感分类的预测性能,使用三种不同的分类器算法和八个葡萄牙语推文数据集。由于缺乏针对巴西葡萄牙语推文的语言模型,我们扩大了评估范围,考虑了六种不同的嵌入:fastText、GloVe、Word2Vec、BERT 多语言 (mBERT)、BERTweet 和 BERTimbau。实验表明,仅使用目标葡萄牙语 BERTimbau 从头开始​​训练的嵌入优于静态表示、fastText、GloVe 和 Word2Vec,以及基于 Transformer 的模型 BERT multilingual 和 BERTweet。此外,

更新日期:2023-06-28
down
wechat
bug