Polling Latent Opinions: A Method for Computational Sociolinguistics Using Transformer Language Models (original) (raw)

Tweets Topic Classification and Sentiment Analysis based on Transformer-based Language Models

Vietnam Journal of Computer Science

People provide information on their thoughts, perceptions, and activities through a wide range of channels, including social media. The wide acceptance of social media results in a vast volume of valuable data, in a variety of formats as well as veracity. Analysis of such 'big data' allows organizations and analysts to make better and faster decisions. However, due to the characteristics of this data being vast in volume and having high velocity, analysis of this data cannot be done manually. The process of automatic quantification of information can be very challenging also due to the possible data ambiguity and complexity. To address automatic information extraction many analytic techniques such as text mining, machine learning, predictive analytics, and diverse natural language processing have been proposed in the literature. Recent advances in Natural Language Understanding-based techniques, more specifically Transformer-based architectures, have shown capabilities to effectively solve sequence-to-sequence modeling tasks while handling long-range dependencies efficiently. Building on these advances, in this work, we proposed a concept to apply transformer-based sequence modeling on short text to perform target classification and also sentiment polarity analysis from userposted tweets. We intended to investigate would these Transformer-based architectures, which can be applied to a large dataset of English text that contains a lot of tokens in

Analyzing COVID-19 Tweets with Transformer-based Language Models

ArXiv, 2021

This paper describes a method for using Transformer-based Language Models (TLMs) to understand public opinion from social media posts. In this approach, we train a set of GPT models on several COVID-19 tweet corpora that reflect populations of users with distinctive views. We then use promptbased queries to probe these models to reveal insights into the biases and opinions of the users. We demonstrate how this approach can be used to produce results which resemble polling the public on diverse social, political and public health issues. The results on the COVID-19 tweet data show that transformer language models are promising tools that can help us understand public opinions on social media at scale.

Transformer based Contextual Model for Sentiment Analysis of Customer Reviews: A Fine-tuned BERT

International Journal of Advanced Computer Science and Applications, 2021

The Bidirectional Encoder Representations from Transformers (BERT) is a state-of-the-art language model used for multiple natural language processing tasks and sequential modeling applications. The accuracy of predictions from contextbased sentiment and analysis of customer review data from various social media platforms are challenging and timeconsuming tasks due to the high volumes of unstructured data. In recent years, more research has been conducted based on the recurrent neural network algorithm, Long Short-Term Memory (LSTM), Bidirectional LSTM (BiLSTM) as well as hybrid, neutral, and traditional text classification algorithms. This paper presents our experimental research work to overcome these known challenges of the sentiment analysis models, such as its performance, accuracy, and context-based predictions. We've proposed a fine-tuned BERT model to predict customer sentiments through the utilization of customer reviews from Twitter, IMDB Movie Reviews, Yelp, Amazon. In addition, we compared the results of the proposed model with our custom Linear Support Vector Machine (LSVM), fastText, BiLSTM and hybrid fastText-BiLSTM models, as well as presented a comparative analysis dashboard report. This experiment result shows that the proposed model performs better than other models with respect to various performance measures.

Using language models to improve opinion detection

Information Processing and Management, 2018

Opinion mining is one of the most important research tasks in the information retrieval research community. With the huge volume of opinionated data available on the Web, approaches must be developed to differentiate opinion from fact. In this paper, we present a lexicon-based approach for opinion retrieval. Generally, opinion retrieval consists of two stages: relevance to the query and opinion detection. In our work, we focus on the second state which itself focusses on detecting opinionated documents. We compare the document to be analyzed with opinionated sources that contain subjective information. We hypothesize that a document with a strong similarity to opinionated sources is more likely to be opinionated itself. Typical lexicon-based approaches treat and choose their opinion sources according to their test collection, then calculate the opinion score based on the frequency of subjective terms in the document. In our work, we use different open opinion collections without any specific treatment and consider them as a reference collection. We then use language models to determine opinion scores. The analysis document and reference collection are represented by different language models (i.e., Dirichlet, Jelinek-Mercer and two-stage models). These language models are generally used in information retrieval to represent the relationship between documents and queries. However, in our study, we modify these language models to represent opinionated documents. We carry out several experiments using Text REtrieval Conference (TREC) Blogs 06 as our analysis collection and Internet Movie Data Bases (IMDB), Multi-Perspective Question Answering (MPQA) and CHESLY as our reference collection. To improve opinion detection, we study the impact of using different language models to represent the document and reference collection alongside different combinations of opinion and retrieval scores. We then use this data to deduce the best opinion detection models. Using the best models, our approach improves on the best baseline of TREC Blog (baseline4) by 30%.

Summarizing Public Opinions in Tweets

Journal Proceedings of CICLing, 2012

The objective of Sentiment Analysis is to identify any clue of positive or negative emotions in a piece of text reflective of the authors opinions on a subject. When performed on large aggregations of user generated content, Sentiment Analysis may be helpful in extracting public opinions. We use Twitter for this purpose and build a classifier which classifies a set of tweets. Often, Machine Learning techniques are applied to Sentiment Classification, which requires a labeled training set of considerable size. We introduce the approach of using words with sentiment value as noisy label in a distant supervised learning environment. We created a training set of such Tweets and used it to train a Naive Bayes Classifier. We test the accuracy of our classifier using a hand labeled training set. Finally, we check if applying a combination of minimum word frequency threshold and Categorical Proportional Difference as the Feature Selection method enhances the accuracy.

Twitter Sentiment Analysis

arXiv (Cornell University), 2015

With the booming of microblogs on the Web, people have begun to express their opinions on a wide variety of topics on Twitter and other similar services. Sentiment analysis on entities (e.g., products, organizations, people, etc.) in tweets (posts on Twitter) thus becomes a rapid and effective way of gauging public opinion for business marketing or social studies. However, Twitter's unique characteristics give rise to new problems for current sentiment analysis methods, which originally focused on large opinionated corpora such as product reviews. In this paper, we propose a new entity-level sentiment analysis method for Twitter. The method first adopts a lexiconbased approach to perform entity-level sentiment analysis. This method can give high precision, but low recall. To improve recall, additional tweets that are likely to be opinionated are identified automatically by exploiting the information in the result of the lexicon-based method. A classifier is then trained to assign polarities to the entities in the newly identified tweets. Instead of being labeled manually, the training examples are given by the lexicon-based approach. Experimental results show that the proposed method dramatically improves the recall and the F-score, and outperforms the state-of-the-art baselines.

From newspaper to microblogging: What does it take to find opinions?

2013

We compare the performance of two lexiconbased sentiment systems ‐ SentiStrength (Thelwall et al., 2012) and SO-CAL (Taboada et al., 2011) ‐ on the two genres of newspaper text and tweets. While SentiStrength has been geared specifically toward short social-media text, SO-CAL was built for general, longer text. After the initial comparison, we successively enrich the SO-CAL-based analysis with tweet-specific mechanisms and observe that in some cases, this improves the performance. A qualitative error analysis then identifies classes of typical problems the two systems have with tweets.

Opinion Mining Using Population-tuned Generative Language Models

arXiv (Cornell University), 2023

We present a novel method for mining opinions from text collections using generative language models trained on data collected from different populations. We describe the basic definitions, methodology and a generic algorithm for opinion insight mining. We demonstrate the performance of our method in an experiment where a pre-trained generative model is fine-tuned using specifically tailored content with unnatural and fully annotated opinions. We show that our approach can learn and transfer the opinions to the semantic classes while maintaining the proportion of polarisation. Finally, we demonstrate the usage of an insight mining system to scale up the discovery of opinion insights from a real text corpus.

Learning Sentiment Based Ranked-Lexicons for Opinion Retrieval

ECIR2015, 2015

In contrast to classic search where users look for factual information, opinion retrieval aims at finding and ranking subjective information. A major challenge of opinion retrieval is the informal nature of user reviews and the domain specific jargon used to describe the targeted item. In this paper, we present an automatic method to learn a space model for opinion retrieval. Our approach is a generative model that learns sentiment word distributions by embedding multi-level relevance judgments in the estimation of the model parameters. In addition to sentiment word distributions, we also infer domain specific named entities that due to their popularity become a sentiment reference in their domain (e.g. name of a movie, “Batman” or specific hotel items, “carpet”). This contrasts with previous approaches that learn a word’s polarity or aspect-based polarity. Opinion retrieval experiments were done in two large datasets with over 703.000 movie reviews and 189.000 hotel reviews. The proposed method achieved better, or equal, performance than the benchmark baselines.