Text classification using graph mining-based feature extraction (original) (raw)

Text Classification of English News Articles using Graph Mining Techniques

Proceedings of the 14th International Conference on Agents and Artificial Intelligence, 2022

Several techniques can be used in the natural language processing systems to understand text documents, such as, text classification. Text Classification is considered a classical problem with several purposes, varying from automated text classification to sentiment analysis. A graph mining technique for the text classification of English news articles is considered in this research. The proposed model was examined where every text is characterized by a graph that codes relations among the various words. A word's significance to a text is presented by the graph-theoretical degree of a graph's vertices. The proposed weighting scheme can significantly obtain the links between the words that co-appear in a text, producing feature vectors that can enhance the English news articles classification. Experiments have been conducted by implementing the proposed classification algorithms in well-known text datasets. The findings suggest that the proposed text classification using graph mining technique as accurate as other techniques using appropriate parameters.

Classification of web documents using a graph model

2003

In this paper we describe work relating to classification of web documents using a graph-based model instead of the traditional vector-based model for document representation. We compare the classification accuracy of the vector model approach using the k-Nearest Neighbor (k-NN) algorithm to a novel approach which allows the use of graphs for document representation in the k-NN algorithm. The proposed method is evaluated on three different web document collections using the leave-one-out approach for measuring classification accuracy. The results show that the graph-based k-NN approach can outperform traditional vector-based k-NN methods in terms of both accuracy and execution time.

Text Analysis Using Different Graph-Based Representations

ComputaciĆ³n y Sistemas, 2018

This paper presents an overview of different graph-based representations proposed to solve text classification tasks. The core of this manuscript is to highlight the importance of enriched/non-enriched co-occurrence graphs as an alternative to traditional features representation models like vector representation, where most of the time these models can not map all the richness of text documents that comes from the web (social media, blogs, personal web pages, news, etc). For each text classification task the type of graph created as well as the benefits of using it are presented and discussed. In specific, the type of features/patterns extracted, the implemented classification/similarity methods and the results obtained in datasets are explained. The theoretical and practical implications of using co-occurrence graphs are also discussed, pointing out the contributions and challenges of modeling text document as graphs.

Model-Based Classification of Web Documents Represented by Graphs

Proc. of WebKDD, 2006

Most web content classification methods are based on the vectorspace model of information retrieval. One of the important advantages of this representation model is that it can be used by both instance-based and model-based classifiers for categorization. However, this popular method of document representation does not capture important structural information, such as the order and proximity of word occurrence or the location of a word within the document. It also makes no use of the mark-up information that can be easily extracted from the web document HTML tags.

Text Classification Using Graph-Encoded Linguistic Elements

Proceedings of the Eighteenth International Florida Artificial Intelligence Research Society Conference (FLAIRS-2005), 2005

Inspired by the goal to more accurately classify text, we describe an effort to map tokens and their characteristic linguistic elements into a graph and use that expressive representation to classify text phrases. We outperform the bag-of-words approach by exploiting word order and the semantic and syntactic characteristics within the phases. In this study, we map tagged corpora into a placeholder graph structure and classify the phrases within, using the crossdimensional linguistic characteristics of each token. Finally, we present heuristics for use in applying this method to other corpora.

A Text Classification Method Based on Combination of Information Gain and Graph Clustering

International Journal of Information and Communication Technology Research, 2019

Text classification has a wide range of applications such as: spam filtering, automated indexing of scientific articles, identifying the genre of documents, news monitoring, and so on. Text datasets usually contain much irrelevant and noisy information which eventually reduces the efficiency and cost of their classification. Therefore, for effective text classification, feature selection methods are widely used to handle the high dimensionality of data. In this paper, a novel feature selection method based on the combination of information gain and FAST algorithm is proposed. In our proposed method, at first, the information gain is calculated for the features and those with higher information gain are selected. The FAST algorithm is then used on the selected features which uses graph-theoretic clustering methods. To evaluate the performance of the proposed method, we carry out experiments on three text datasets and compare our algorithm with several feature selection techniques. The results confirm that the proposed method produces smaller feature subset in shorter time. In addition, the evaluation of a K-nearest neighborhood classifier on validation data show that, the novel algorithm gives higher classification accuracy.

Analysis Using Different Graph-Based Representations

2017

This paper presents an overview of different graph-based representations proposed to solve text classification tasks. The core of this manuscript is to highlight the importance of enriched/non-enriched co-occurrence graphs as an alternative to traditional features representation models like vector representation, where most of the time these models can not map all the richness of text documents that comes from the web (social media, blogs, personal web pages, news, etc). For each text classification task the type of graph created as well as the benefits of using it are presented and discussed. In specific, the type of features/patterns extracted, the implemented classification/similarity methods and the results obtained in datasets are explained. The theoretical and practical implications of using co-occurrence graphs are also discussed, pointing out the contributions and challenges of modeling text document as graphs.

Text classification using Semantic Information and Graph Kernels

2011

Abstract. The most common approach to the text classification problem is to use a bag-of-words representation of documents to find the classification target function. Linguistic structures such as morphology, syntax and semantic are completely neglected in the learning process. This paper uses another document representation that, while including its context independent sentence meaning, is able to be used by a structured kernel function, namely the direct product kernel.

Automatic Linguistic Pattern Identification Based on Graph Text Representation

Research in Computing Science

In this paper it is presented a model of text representation based on graphs. The model is applied in the particular case study of authorship attribution. The experiments were performed by using a corpus made up of 500 documents written by 10 different authors (50 documents per author). The obtained results highlight the benefit of using text features at different levels of language description in tasks associated to automatic processing of information. In particular, we have obtained a performance of 57% of accuracy for the authorship attribution task.

An Innovative Graph-Based Approach to Advance Feature Selection from Multiple Textual Documents

IFIP Advances in Information and Communication Technology

This paper introduces a novel graph-based approach to select features from multiple textual documents. The proposed solution enables the investigation of the importance of a term into a whole corpus of documents by utilizing contemporary graph theory methods, such as community detection algorithms and node centrality measures. Compared to well-tried existing solutions, evaluation results show that the proposed approach increases the accuracy of most text classifiers employed and decreases the number of features required to achieve 'state-of-theart' accuracy. Well-known datasets used for the experimentations reported in this paper include 20Newsgroups, LingSpam, Amazon Reviews and Reuters.