Exploring Information Retrieval by Latent Semantic Indexing and Latent Dirichlet Allocation Techniques (original) (raw)

A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING

A Large number of digital text information is generated every day. Effectively searching, managing and exploring the text data has become a main task. In this paper, we first represent an introduction to text mining and a probabilistic topic model Latent Dirichlet allocation. Then two experiments are proposed-Wikipedia articles and users' tweets topic modelling. The former one builds up a document topic model, aiming to a topic perspective solution on searching, exploring and recommending articles. The latter one sets up a user topic model, providing a full research and analysis over Twitter users' interest. The experiment process including data collecting, data pre-processing and model training is fully documented and commented. Further more, the conclusion and application of this paper could be a useful computation tool for social and business research.

A Semantics-enhanced Topic Modelling Technique: Semantic-LDA

2024

Topic modelling is a beneficial technique used to discover latent topics in text collections. But to correctly understand the text content and generate a meaningful topic list, semantics are important. By ignoring semantics, that is, not attempting to grasp the meaning of the words, most of the existing topic modelling approaches can generate some meaningless topic words. Even existing semantic-based approaches usually interpret the meanings of words without considering the context and related words. In this article, we introduce a semantic-based topic model called semantic-LDA that captures the semantics of words in a text collection using concepts from an external ontology. A new method is introduced to identify and quantify the concept-word relationships based on matching words from the input text collection with concepts from an ontology without using pre-calculated values from the ontology that quantify the relationships between the words and concepts. These pre-calculated values may not reflect the actual relationships between words and concepts for the input collection, because they are derived from datasets used to build the ontology rather than from the input collection itself. Instead, quantifying the relationship based on the word distribution in the input collection is more realistic and beneficial in the semantic capture process. Furthermore, an ambiguity handling mechanism is introduced to interpret the unmatched words, that is, words for which there are no matching concepts in the ontology. Thus, this article makes a significant contribution by introducing a semantic-based topic model that calculates the word-concept relationships directly from the input text collection. The proposed semantic-based topic model and an enhanced version with the disambiguation mechanism were evaluated against a set of state-of-the-art systems, and our approaches outperformed the baseline systems in both topic quality and information filtering evaluations. CCS Concepts: • Applied computing → Document management and text processing; • Computing methodologies → Information extraction;

Document Indexing by Latent Dirichlet Allocation

An automatic document indexing method with a probabilistic concept search is presented. The proposed method utilizes Latent Dirichlet Allocation (LDA), a generative model for document modeling and classification. Ad hoc applications of LDA to document indexing, or their variants with smoothing techniques as prompted by previous studies in LDA-based language modeling, would result in disadvantaged empirical approaches. They could result in unsatisfactory performance as the terms in documents may not properly reflect concept space. In this study, we introduce a new definition of document probability vectors in the context of LDA and present a scheme for automatic document indexing based on it. The results of our computational experiment on a benchmark data set indicate that the proposed approach is a viable option to use in the document indexing. A small illustrative example is also included.

Search and classify topics in a corpus of text using the latent dirichlet allocation model

Indonesian Journal of Electrical Engineering and Computer Science

This work aims at discovering topics in a text corpus and classifying the most relevant terms for each of the discovered topics. The process was performed in four steps: first, document extraction and data processing; second, labeling and training of the data; third, labeling of the unseen data; and fourth, evaluation of the model performance. For processing, a total of 10,322 "curriculum" documents related to data science were collected from the web during 2018-2022. The latent dirichlet allocation (LDA) model was used for the analysis and structure of the subjects. After processing, 12 themes were generated, which allowed ranking the most relevant terms to identify the skills of each of the candidates. This work concludes that candidates interested in data science must have skills in the following topics: first, they must be technical, they must have mastery of structured query language, mastery of programming languages such as R, Python, java, and data management, among...

www.ijacsa.thesai.org A Survey of Topic Modeling in Text Mining

2015

Abstract—Topic models provide a convenient way to analyze large of unclassified text. A topic contains a cluster of words that frequently occur together. A topic modeling can connect words with similar meanings and distinguish between uses of words with multiple meanings. This paper provides two categories that can be under the field of topic modeling. First one discusses the area of methods of topic modeling, which has four methods that can be considerable under this category. These methods are Latent semantic analysis (LSA), Probabilistic latent semantic analysis (PLSA), Latent Dirichlet allocation (LDA), and Correlated topic model (CTM). The second category is called topic evolution models, which model topics by considering an

Indexing by Latent Semantic Analysis

Journal of The American Society for Information Science and Technology, 1990

A new method for automatic indexing and retrieval is described. The approach is to take advantage of implicit higher-order structure in the association of terms with documents ("semantic structure") in order to improve the detection of relevant documents on the basis of terms found in queries. The particular technique used is singular-value decomposition, in which a large term by document matrix is decomposed into a set of ca. 100 orthogonal factors from which the original matrix can be approximated by linear combination. Documents are represented by ca. 100 item vectors of factor weights. Queries are represented as pseudo-document vectors formed from weighted combinations of terms, and documents with supra-threshold cosine values are returned. initial tests find this completely automatic method for retrieval to be promising.

Topic modeling Twitter data using Latent Dirichlet Allocation and Latent Semantic Analysis

THE 2ND INTERNATIONAL CONFERENCE ON SCIENCE, MATHEMATICS, ENVIRONMENT, AND EDUCATION

The industrial world has entered the era of industrial revolution 4.0. In this era, there is an urgent data requirement from the community to support service policies. Because of that, Surabaya Government made Media Center Surabaya. This media is used to accommodate all the aspiration of Surabaya citizen. To access this media, a citizen can use Twitter. The topic which is discussed in Twitter is important information that we need to know. The information can be used to improve the performance of Surabaya Government services. Twitter data is a text data that consists of thousands of variables. Text mining is frequently used to analyze this kind of data, including topic modeling and sentiment analysis. This study would work on topic modeling focused on the algorithm employing Latent Dirichlet Allocation (LDA) and Latent Semantic Analysis (LSA). The evaluation of the algorithm performance uses the topic coherence. As unstructured data, the Twitter data need preprocessing before the analysis. The stages of preprocessing include cleansing, stemming, and stop words. The advantages of LSA are fast and easy to implement. LSA, on the other hand, doesn't consider the relationship between documents in the corpus, while LDA does. This study shows that LDA gives a better result than LSA.

Topic Modeling Using Latent Dirichlet allocation

ACM Computing Surveys, 2022

We are not able to deal with a mammoth text corpus without summarizing them into a relatively small subset. A computational tool is extremely needed to understand such a gigantic pool of text. Probabilistic Topic Modeling discovers and explains the enormous collection of documents by reducing them in a topical subspace. In this work, we study the background and advancement of topic modeling techniques. We first introduce the preliminaries of the topic modeling techniques and review its extensions and variations, such as topic modeling over various domains, hierarchical topic modeling, word embedded topic models, and topic models in multilingual perspectives. Besides, the research work for topic modeling in a distributed environment, topic visualization approaches also have been explored. We also covered the implementation and evaluation techniques for topic models in brief. Comparison matrices have been shown over the experimental results of the various categories of topic modeling....

Topic Modeling: A Comprehensive Review

ICST Transactions on Scalable Information Systems

Topic modelling is the new revolution in text mining. It is a statistical technique for revealing the underlying semantic structure in large collection of documents. After analysing approximately 300 research articles on topic modeling, a comprehensive survey on topic modelling has been presented in this paper. It includes classification hierarchy, Topic modelling methods, Posterior Inference techniques, different evolution models of latent Dirichlet allocation (LDA) and its applications in different areas of technology including Scientific Literature, Bioinformatics, Software Engineering and analysing social network is presented. Quantitative evaluation of topic modeling techniques is also presented in detail for better understanding the concept of topic modeling. At the end paper is concluded with detailed discussion on challenges of topic modelling, which will definitely give researchers an insight for good research.

Telkom University News Topic Modeling Using Latent Semantic Analysis (LSA) Method on Online News Portal

Building of Informatics, Technology and Science (BITS)

In this day and age, the development of online news portals regarding news is quite easy to access, online news portals are information that explains an event that has occurred or is happening with electronic media intermediaries, as well as news about Telkom University which is quite easily accessible through online news portals. A system has been designed that is capable of modeling Telkom University news topics. Modeling news topics is very interesting to be used as research material because the process of understanding each individual on the topics contained in the news is different, therefore topic modeling is needed to find out what topics are news about Telkom University. In this study, a Latent Semantic Analysis (LSA) model has been designed to carry out a topic modeling process that aims to make it easier for readers to understand news topics related to Telkom University, Latent Semantic Analysis (LSA) is a mathematical method in finding hidden topics by analyzing the struc...

Exploring Information Retrieval by Latent Semantic Indexing and Latent Dirichlet Allocation Techniques (original) (raw)

Related papers