Related papers
Topic modeling Twitter data using Latent Dirichlet Allocation and Latent Semantic Analysis
The industrial world has entered the era of industrial revolution 4.0. In this era, there is an urgent data requirement from the community to support service policies. Because of that, Surabaya Government made Media Center Surabaya. This media is used to accommodate all the aspiration of Surabaya citizen. To access this media, a citizen can use Twitter. The topic which is discussed in Twitter is important information that we need to know. The information can be used to improve the performance of Surabaya Government services. Twitter data is a text data that consists of thousands of variables. Text mining is frequently used to analyze this kind of data, including topic modeling and sentiment analysis. This study would work on topic modeling focused on the algorithm employing Latent Dirichlet Allocation (LDA) and Latent Semantic Analysis (LSA). The evaluation of the algorithm performance uses the topic coherence. As unstructured data, the Twitter data need preprocessing before the analysis. The stages of preprocessing include cleansing, stemming, and stop words. The advantages of LSA are fast and easy to implement. LSA, on the other hand, doesn't consider the relationship between documents in the corpus, while LDA does. This study shows that LDA gives a better result than LSA.
www.ijacsa.thesai.org A Survey of Topic Modeling in Text Mining
Abstract—Topic models provide a convenient way to analyze large of unclassified text. A topic contains a cluster of words that frequently occur together. A topic modeling can connect words with similar meanings and distinguish between uses of words with multiple meanings. This paper provides two categories that can be under the field of topic modeling. First one discusses the area of methods of topic modeling, which has four methods that can be considerable under this category. These methods are Latent semantic analysis (LSA), Probabilistic latent semantic analysis (PLSA), Latent Dirichlet allocation (LDA), and Correlated topic model (CTM). The second category is called topic evolution models, which model topics by considering an
Topic Modeling Using Latent Dirichlet allocation
ACM Computing Surveys, 2022
We are not able to deal with a mammoth text corpus without summarizing them into a relatively small subset. A computational tool is extremely needed to understand such a gigantic pool of text. Probabilistic Topic Modeling discovers and explains the enormous collection of documents by reducing them in a topical subspace. In this work, we study the background and advancement of topic modeling techniques. We first introduce the preliminaries of the topic modeling techniques and review its extensions and variations, such as topic modeling over various domains, hierarchical topic modeling, word embedded topic models, and topic models in multilingual perspectives. Besides, the research work for topic modeling in a distributed environment, topic visualization approaches also have been explored. We also covered the implementation and evaluation techniques for topic models in brief. Comparison matrices have been shown over the experimental results of the various categories of topic modeling....
Topic Modeling: A Comprehensive Review
ICST Transactions on Scalable Information Systems
Topic modelling is the new revolution in text mining. It is a statistical technique for revealing the underlying semantic structure in large collection of documents. After analysing approximately 300 research articles on topic modeling, a comprehensive survey on topic modelling has been presented in this paper. It includes classification hierarchy, Topic modelling methods, Posterior Inference techniques, different evolution models of latent Dirichlet allocation (LDA) and its applications in different areas of technology including Scientific Literature, Bioinformatics, Software Engineering and analysing social network is presented. Quantitative evaluation of topic modeling techniques is also presented in detail for better understanding the concept of topic modeling. At the end paper is concluded with detailed discussion on challenges of topic modelling, which will definitely give researchers an insight for good research.
A Topic Modeling Comparison Between LDA, NMF, Top2Vec, and BERTopic to Demystify Twitter Posts
Frontiers in Sociology, 2022
The richness of social media data has opened a new avenue for social science research to gain insights into human behaviors and experiences. In particular, emerging data-driven approaches relying on topic models provide entirely new perspectives on interpreting social phenomena. However, the short, text-heavy, and unstructured nature of social media content often leads to methodological challenges in both data collection and analysis. In order to bridge the developing field of computational science and empirical social research, this study aims to evaluate the performance of four topic modeling techniques; namely latent Dirichlet allocation (LDA), non-negative matrix factorization (NMF), Top2Vec, and BERTopic. In view of the interplay between human relations and digital media, this research takes Twitter posts as the reference point and assesses the performance of different algorithms concerning their strengths and weaknesses in a social science context. Based on certain details during the analytical procedures and on quality issues, this research sheds light on the efficacy of using BERTopic and NMF to analyze Twitter data.
A scoping review of topic modelling on online data
The Indonesian Journal of Electrical Engineering and Computer Science (IJEECS), 2023
With the increasing prevalence of unstructured online data generated (e.g., social media, online forums), mining them is important since they provide a genuine viewpoint of the public. Due to this significant advantage, topic modelling has become more important than ever. Topic modelling is a natural language processing (NLP) technique that mainly reveals relevant topics hidden in text corpora. This paper aims to review recent research trends in topic modelling and state-of-the-art techniques used when dealing with online data. Preferred reporting items for systematic reviews and meta-analysis (PRISMA) methodology was used in this scoping review. This study was conducted on recent research works published from 2020 to 2022. We constructed 5 research questions for the interest of many researchers. 36 relevant papers revealed that more work on non-English languages is needed, common pre-processing techniques were applied to all datasets regardless of language e.g., stop word removal; latent dirichlet allocation (LDA) is the most used modelling technique and also one of the best performing; and the produced result is most evaluated using topic coherence. In conclusion, topic modelling has largely benefited from LDA, thus, it is interesting to see if this trend continues in the future across languages.
IEEE Access
Topic modelling is important for tackling several data mining tasks in information retrieval. While seminal topic modelling techniques such as Latent Dirichlet Allocation (LDA) have been proposed, the ubiquity of social media and the brevity of its texts pose unique challenges for such traditional topic modelling techniques. Several extensions including auxiliary aggregation, self aggregation and direct learning have been proposed to mitigate these challenges, however some still remain. These include a lack of consistency in the topics generated and the decline in model performance in applications involving disparate document lengths. There is a recent paradigm shift towards neural topic models, which are not suited for resource-constrained environments. This paper revisits LDA-style techniques, taking a theoretical approach to analyse the relationship between word co-occurrence and topic models. Our analysis shows that by altering the word co-occurrences within the corpus, topic discovery can be enhanced. Thus we propose a novel data transformation approach dubbed DATM to improve the topic discovery within a corpus. A rigorous empirical evaluation shows that DATM is not only powerful, but it can also be used in conjunction with existing benchmark techniques to significantly improve their effectiveness and their consistency by up to 2 fold. INDEX TERMS Document transformation, greedy algorithm, information retrieval, latent dirichlet allocation, multi-set multi-cover problem, probabilistic generative topic modelling.
Search and classify topics in a corpus of text using the latent dirichlet allocation model
Indonesian Journal of Electrical Engineering and Computer Science
This work aims at discovering topics in a text corpus and classifying the most relevant terms for each of the discovered topics. The process was performed in four steps: first, document extraction and data processing; second, labeling and training of the data; third, labeling of the unseen data; and fourth, evaluation of the model performance. For processing, a total of 10,322 "curriculum" documents related to data science were collected from the web during 2018-2022. The latent dirichlet allocation (LDA) model was used for the analysis and structure of the subjects. After processing, 12 themes were generated, which allowed ranking the most relevant terms to identify the skills of each of the candidates. This work concludes that candidates interested in data science must have skills in the following topics: first, they must be technical, they must have mastery of structured query language, mastery of programming languages such as R, Python, java, and data management, among...
Tiket.com is a company that provides online ticket booking services in Indonesia. Tiketcom wants to improve services by knowing content that is widely discussed by the public and positive and negative comments on Tiketcom. Therefore an analysis will be done using Twitter accounts with the Latent Dirichlet Allocation (LDA) method which aims to find patterns in a document that raises various topics from text data and sentiment analysis to find out positive and negative comments on Tiketcom. The data used is tweet and retweet the users Twitter to Tiketcom accounts starting from 17 November 2018 to 4 March 2019. Obtained as many as 20 topics in the text data and taken 5 topics with the highest coherence value to obtain a topic model. After analyzing the LDA it was found that 5 topics that were widely discussed were promo discount tickets provided by Tiketcom. In sentiment analysis 21.1% of negative tweets were obtained, mostly discussing disruption to ticket reservations and 15.4% posit...