Monitoring geometrical properties of word embeddings for detecting the emergence of new topics (original) (raw)
Related papers
Modeling Evolution of Topics in Large-Scale Temporal Text Corpora
Proceedings of the International AAAI Conference on Web and Social Media
Large text temporal collections provide insights into social and cultural change over time. To quantify changes in topics in these corpora, embedding methods have been used as a diachronic tool. However, they have limited utility for modeling changes in topics due to the stochastic nature of training. We propose a new computational approach for tracking and detecting temporal evolution of topics in a large collection of texts. This approach for identifying dynamic topics and modeling their evolution combines the advantages of two methods: (1) word embeddings to learn contextual semantic representation of words from temporal snapshots of the data and (2) dynamic network analysis to identify dynamic topics by using dynamic semantic similarity networks developed using embedding models. Experimenting with two large temporal data sets from the legal and real estate domains, we show that this approach performs faster (due to parallelizing different snapshots), uncovers more coherent topic...
A Word Embedding Topic Model for Robust Inference of Topics and Visualization
Probabilistic topic models for semantic visualization are useful for discovering and visualizing latent topics in document collections. In these models, the inference of topics and visualization is largely based on word co-occurrences within documents. Therefore, when documents in a corpus are short in length, these models may not achieve good results due to the sparsity of word co-occurrences. In this paper, we propose a word embedding topic model (WTM) that is robust to data sparsity when detecting topics and generating visualization of short texts. Extensive experiments conducted on four real-world datasets show that WTM is more effective in dealing with short texts than state-of-the-art models.
Topic Modelling with Word Embeddings
English. This work aims at evaluating and comparing two different frameworks for the unsupervised topic modelling of the CompWHoB Corpus, namely our political-linguistic dataset. The first approach is represented by the application of the latent DirichLet Allocation (henceforth LDA), defining the evaluation of this model as baseline of comparison. The second framework employs Word2Vec technique to learn the word vector representations to be later used to topic-model our data. Compared to the previously defined LDA baseline, results show that the use of Word2Vec word embeddings significantly improves topic modelling performance but only when an accurate and task-oriented linguistic pre-processing step is carried out.
Temporal Analysis on Topics Using Word2Vec
2022
The present study proposes a novel method of trend detection and visualization-more specifically, modeling the change in a topic over time. Where current models used for the identification and visualization of trends only convey the popularity of a singular word based on stochastic counting of usage, the approach in the present study illustrates the popularity and direction that a topic is moving in. The direction in this case is a distinct subtopic within the selected corpus. Such trends are generated by modeling the movement of a topic by using k-means clustering and cosine similarity to group the distances between clusters over time. In a convergent scenario, it can be inferred that the topics as a whole are meshing (tokens between topics, becoming interchangeable). On the contrary, a divergent scenario would imply that each topics' respective tokens would not be found in the same context (the words are increasingly different to each other). The methodology was tested on a group of articles from various media houses present in the 20 Newsgroups dataset.
Big Data and Cognitive Computing
The study of the dynamics or the progress of science has been widely explored with descriptive and statistical analyses. Also this study has attracted several computational approaches that are labelled together as the Computational History of Science, especially with the rise of data science and the development of increasingly powerful computers. Among these approaches, some works have studied dynamism in scientific literature by employing text analysis techniques that rely on topic models to study the dynamics of research topics. Unlike topic models that do not delve deeper into the content of scientific publications, for the first time, this paper uses temporal word embeddings to automatically track the dynamics of scientific keywords over time. To this end, we propose Vec2Dynamics, a neural-based computational history approach that reports stability of k-nearest neighbors of scientific keywords over time; the stability indicates whether the keywords are taking new neighborhood du...
Modeling Text using the Continuous Space Topic Model with Pre-Trained Word Embeddings
2021
In this study, we propose a model that extends the continuous space topic model (CSTM), which flexibly controls word probability in a document, using pre-trained word embeddings. To develop the proposed model, we pre-train word embeddings, which capture the semantics of words and plug them into the CSTM. Intrinsic experimental results show that the proposed model exhibits a superior performance over the CSTM in terms of perplexity and convergence speed. Furthermore, extrinsic experimental results show that the proposed model is useful for a document classification task when compared with the baseline model. We qualitatively show that the latent coordinates obtained by training the proposed model are better than those of the baseline model.
Leap2Trend: A Temporal Word Embedding Approach for Instant Detection of Emerging Scientific Trends
IEEE Access
Early detection of emerging research trends could potentially revolutionise the way research is done. For this reason, trend analysis has become an area of paramount importance in academia and industry. This is due to the significant implications for research funding and public policy. The literature presents several emerging approaches to detecting new research trends. Most of these approaches rely mainly on citation counting. While citations have been widely used as indicators of emerging research topics, they suffer from some limitations. For instance, citations can take months to years to progress and then to reveal trends. Furthermore, they fail to dig into paper content. To overcome this problem, we introduce Leap2Trend, a novel approach to instant detection of research trends. Leap2Trend relies on temporal word embeddings (word2vec) to track the dynamics of similarities between pairs of keywords, their rankings and respective uprankings (ascents) over time. We applied Leap2Trend to two scientific corpora on different research areas, namely computer science and bioinformatics and we evaluated it against two gold standards Google Trends hits and Google Scholar citations. The obtained results reveal the effectiveness of our approach to detect trends with more than 80% accuracy and 90% precision in some cases. Such significant findings evidence the utility of our Leap2Trend approach for tracking and detecting emerging research trends instantly.
Diffusion-based Temporal Word Embeddings
2021
Semantics in natural language processing is largely dependent on contextual relationships between words and entities in documents. The context of a word may evolve. For example, the word “apple” currently has two contexts – a fruit and a technology company. The changes in the context of entities in biomedical publications can help us understand the evolution of a disease and relevant scientific interventions. In this work, we present a new diffusion-based temporal word embedding model that can capture short and long-term changes in the semantics of biomedical entities. Our model captures how the context of each entity shifts over time. Existing dynamic word embeddings capture semantic evolution at a discrete/granular level, aiming to study how a language developed over a long period. Our approach provides smooth embeddings suitable for studying short as well as long-term changes. For the evaluation of the proposed model, we track the semantic evolution of entities in abstracts of bi...
Levaraging Social Context for Topic Evolution
21th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
Topic discovery and evolution (TDE) has been a problem which has gained long standing interest in the research community. The goal in topic discovery is to identify groups of keywords from large corpora so that the information in those corpora are summarized succinctly. The nature of text corpora has changed dramatically in the past few years with the advent of social media. Social media services allow users to constantly share, follow and comment on posts from other users. Hence, such services have given a new dimension to the traditional text corpus. The new dimension being that today’s corpora have a social context embedded in them in terms of the community of users interested in a particular post, their profiles etc. We wish to harness this social context that comes along with the textual content for TDE. In particular, our goal is to both qualitatively and quantitatively analyze when social context actually helps with TDE. Methodologically, we approach the problem of TDE by a proposing non-negative matrix factorization (NMF) based model that incorporates both the textual information and social context information. We perform experiments on large scale real world dataset of news articles, and use Twitter as the platform providing information about the social context of these news articles. We compare with and outperform several state-of-the-art baselines. Our conclusion is that using the social context information is most useful hen faced with topics that are particularly difficult to detect.
2020
Production of news content is growing at an astonishing rate. To help manage and monitor the sheer amount of text, there is an increasing need to develop efficient methods that can provide insights into emerging content areas, and stratify unstructured corpora of text into `topics' that stem intrinsically from content similarity. Here we present an unsupervised framework that brings together powerful vector embeddings from natural language processing with tools from multiscale graph partitioning that can reveal natural partitions at different resolutions without making a priori assumptions about the number of clusters in the corpus. We show the advantages of graph-based clustering through end-to-end comparisons with other popular clustering and topic modelling methods, and also evaluate different text vector embeddings, from classic Bag-of-Words to Doc2Vec to the recent transformers based model Bert. This comparative work is showcased through an analysis of a corpus of US news c...