Graph-based Techniques for Topic Classification of Tweets in Spanish (original) (raw)
Related papers
A linguistic approach for determining the topics of Spanish Twitter messages
Journal of Information Science, 2014
The vast number of opinions and reviews provided in Twitter is helpful in order to make interesting findings about a given industry, but given the huge number of messages published every day, it is important to detect the relevant ones. In this respect, the Twitter search functionality is not a practical tool when we want to poll messages dealing with a given set of general topics. This article presents an approach to classify Twitter messages into various topics. We tackle the problem from a linguistic angle, taking into account part-of-speech, syntactic and semantic information, showing how language processing techniques should be adapted to deal with the informal language present in Twitter messages. The TASS 2013 General corpus, a collection of tweets that has been specifically annotated to perform text analytics tasks, is used as the dataset in our evaluation framework. We carry out a wide range of experiments to determine which kinds of linguistic information have the greatest...
State-of-the-art approaches on cross-source topic classification (TC) of Tweets rely on building a supervised machine learning classifier on Social Knowledge Sources (KSs) (such as DBpedia and Freebase) for detecting topics of Tweets. These approaches typically employ various lexical, syntactical or semantic features derived from the content of these documents or Tweets, often ignoring other indicators to external data sources (e.g. URL), which can provide additional background information for cross-source TC. In order to address these limitations, in this paper we analyse various such indicators, and evaluate their impact on cross-source TC. Our experiments, evaluating the proposed TC in the context of Violence Detection (VD) and Emergency Response (ER) tasks, indicate that the Twitter specific information (indicators) contain valuable information; and thus incorporating them into a TC can improve the performance over previous approaches not considering them.
Graph vs. bag representation models for the topic classification of web documents
World Wide Web, 2015
Text classification constitutes a popular task in Web research with various applications that range from spam filtering to sentiment analysis. In this paper, we argue that its performance depends on the quality of Web documents, which varies significantly. For example, the curated content of news articles involves different challenges than the user-generated content of blog posts and Social Media messages. We experimentally verify our claim, quantifying the main factors that affect the performance of text classification. We also argue that the established bagof-words representation models are inadequate for handling all document types, as they merely extract frequent, yet distinguishing terms from the textual content of the training set. Thus, they suffer from low robustness in the context of noisy or unseen content, unless they are enriched with contextual, application-specific information. In their place, we propose the use of n-gram graphs, a model that goes beyond the bag-of-words representation, transforming every document into a graph: its nodes correspond to character or word n-grams and the co-occurring ones are connected by weighted edges. Individual document graphs can be combined into class graphs and graph similarities are employed to position and classify documents into the vector space. This approach offers two advantages with respect to bag models: first, classification accuracy increases due to the contextual information that is encapsulated in the edges of the n-gram graphs. Second, it reduces the search space to a limited set of robust, endogenous features that depend on the number of classes, rather than the size of the vocabulary. Our thorough experimental study over three large, real-world corpora confirms the superior performance of n-gram graphs across the main types of Web documents.
Detect Text Topics by Semantics Graphs
It is beneficial for document topic analysis to build a bridge between word embedding process and graph capacity to connect the dots and represent complex correlations between entities. In this study we examine processes of building a semantic graph model, finding document topics and validating topic discovery. We introduce a novel Word2Vec2Graph model that is built on top of Word2Vec word embedding model. We demonstrate how this model can be used to analyze long documents and uncover document topics as graph clusters. To validate topic discovery method we transfer words to vectors and vectors to images and use deep learning image classification.
2017
Topic identification as a specific case of text classification is one of the primary steps toward knowledge extraction from the raw textual data. In such tasks, words are dealt with as a set of features. Due to high dimensionality and sparseness of feature vector result from traditional feature selection methods, most of the proposed text classification methods for this purpose lack performance and accuracy. In dealing with tweets which are limited in the number of words the aforementioned problems are reflected more than ever. In order to alleviate such issues, we have proposed a new topic identification method for Spanish tweets based on the deep representation of Spanish words. In the proposed method, words are represented as multi-dimensional vectors, in other words, words are replaced with their equivalent vectors which are calculated based on some transformation of raw text data. Average aggregation technique is used to transform the word vectors into tweet representation. Our...
A Graph Analytical Approach for Topic Detection
ACM Transactions on Internet Technology, 2013
Topic detection with large and noisy data collections such as social media must address both scalability and accuracy challenges. KeyGraph is an efficient method that improves on current solutions by considering keyword cooccurrence. We show that KeyGraph has similar accuracy when compared to state-of-the-art approaches on small, well-annotated collections, and it can successfully filter irrelevant documents and identify events in large and noisy social media collections. An extensive evaluation using Amazon’s Mechanical Turk demonstrated the increased accuracy and high precision of KeyGraph, as well as superior runtime performance compared to other solutions.
Short text classification in twitter to improve information filtering
2010
In micro-blogging services such as Twitter, the users may get overwhelmed by the raw data. One solution to this problem is the classification of Twitter messages (tweets). As short texts like tweets do not provide sufficient word occurrences, classification methods that use traditional approaches such as "Bag-Of-Words" have limitations. To address this problem, we propose to use a small set of domain-specific features extracted from the author"s profile and text. The proposed approach effectively classifies the text to a predefined set of generic classes such as News, Events, Opinions, Deals, and Private Messages.
Topic Classification for Short Texts
Lecture notes in information systems and organisation, 2023
In the context of TV and social media surveillance, constructing models to automate topic identification of short texts is a key task. This paper constructs worth-to-consider models for practical usage, employing Top-K multinomial classification methodology. We describe the full data processing pipeline, discussing about dataset selection, text preprocessing, feature extraction, model selection and learning, including hyperparameter optimization. We will test and compare popular methods including: standard machine learning, deep learning, and a fine-tuned BERT for topic classification.
Classification of text data from the social network Twitter
Proceedings of International conference Information Technology and Nanotechnology (ITNT-2016), 2016
Social networks play an important role in the modern world, and it is important to define the important and popular topics discussed. This article deals with data collection from the social network Twitter, and further clustering and classification of the collected data.
Classification Method for Shared Information on Twitter Without Text Data
Proceedings of the 24th International Conference on World Wide Web - WWW '15 Companion, 2015
During a disaster, appropriate information must be collected. For example, victims and survivors require information about shelter locations and dangerous points or advice about protecting themselves. Rescuers need information about the details of volunteer activities and supplies, especially potential shortages. However, collecting such localized information is dicult from such mass media as TV and newspapers because they generally focus on information aimed at the general public. On the other hand, social media can attract more attention than mass media under these circumstances since they can provide such localized information. In this paper, we focus on Twitter, one of the most inuential social media, as a source of local information. By assuming that users who retweet the same tweet are interested in the same topic, we can classify tweets that are required by users with similar interests based on retweets. Thus, we propose a novel tweet classication method that focuses on retweets without text mining. We linked tweets based on retweets to make a retweet network that connects similar tweets and extracted clusters that contain similar tweets from the constructed network by our clustering method. We also subjectively veried the validity of our proposed classication method. Our experiment veried that the ratio of the clusters whose tweets are mutually similar in the cluster to all Copyright is held by the International World Wide Web Conference Committee (IW3C2). IW3C2 reserves the right to provide a hyperlink to the author's site if the Material is used in electronic media. clusters is very high and the similarities in each cluster are obvious. Finally, we calculated the linguistic similarities of the results to clarify our proposed method s features. Our method classied topic-similar tweets, even if they are not linguistically similar.