An Innovative Graph-Based Approach to Advance Feature Selection from Multiple Textual Documents (original) (raw)

Text Classification of English News Articles using Graph Mining Techniques

Proceedings of the 14th International Conference on Agents and Artificial Intelligence, 2022

Several techniques can be used in the natural language processing systems to understand text documents, such as, text classification. Text Classification is considered a classical problem with several purposes, varying from automated text classification to sentiment analysis. A graph mining technique for the text classification of English news articles is considered in this research. The proposed model was examined where every text is characterized by a graph that codes relations among the various words. A word's significance to a text is presented by the graph-theoretical degree of a graph's vertices. The proposed weighting scheme can significantly obtain the links between the words that co-appear in a text, producing feature vectors that can enhance the English news articles classification. Experiments have been conducted by implementing the proposed classification algorithms in well-known text datasets. The findings suggest that the proposed text classification using graph mining technique as accurate as other techniques using appropriate parameters.

Text classification using graph mining-based feature extraction

2010

A graph-based approach to document classification is described in this paper. The graph representation offers the advantage that it allows for a much more expressive document encoding than the more standard bag of words/phrases approach, and consequently gives an improved classification accuracy.

Cluster analysis in document networks

Data Mining IX, 2008

Text or document clustering is a subset of a larger field of data clustering and has been one of the research hotspots in text mining. On the other hand, recent studies have shown that many real systems may be represented as complex networks with astonishing similar proprieties. In this work a document corpora is represented as a complex network of documents, in which the nodes represent the documents and the edges are weighted according to the similarities among documents. The detection of community structures in complex networks can be seen as the cluster analysis in document networks. Recently community detection algorithms based on spectral proprieties of the underlying has shown good results. The main motivation for applying those methods is that they have shown to be robust to the high dimensionality of feature space and also to the inherent data sparsity resulting from text representation in the vector space model. The aim of this paper is to present the application of the community structures algorithms for text mining. Experiments have been carried out on the document clustering problems taken from 20 newsgroup document corpora to evaluate the performance of the proposed approach.

A Text Classification Method Based on Combination of Information Gain and Graph Clustering

International Journal of Information and Communication Technology Research, 2019

Text classification has a wide range of applications such as: spam filtering, automated indexing of scientific articles, identifying the genre of documents, news monitoring, and so on. Text datasets usually contain much irrelevant and noisy information which eventually reduces the efficiency and cost of their classification. Therefore, for effective text classification, feature selection methods are widely used to handle the high dimensionality of data. In this paper, a novel feature selection method based on the combination of information gain and FAST algorithm is proposed. In our proposed method, at first, the information gain is calculated for the features and those with higher information gain are selected. The FAST algorithm is then used on the selected features which uses graph-theoretic clustering methods. To evaluate the performance of the proposed method, we carry out experiments on three text datasets and compare our algorithm with several feature selection techniques. The results confirm that the proposed method produces smaller feature subset in shorter time. In addition, the evaluation of a K-nearest neighborhood classifier on validation data show that, the novel algorithm gives higher classification accuracy.

Text Analysis Using Different Graph-Based Representations

Computación y Sistemas, 2018

This paper presents an overview of different graph-based representations proposed to solve text classification tasks. The core of this manuscript is to highlight the importance of enriched/non-enriched co-occurrence graphs as an alternative to traditional features representation models like vector representation, where most of the time these models can not map all the richness of text documents that comes from the web (social media, blogs, personal web pages, news, etc). For each text classification task the type of graph created as well as the benefits of using it are presented and discussed. In specific, the type of features/patterns extracted, the implemented classification/similarity methods and the results obtained in datasets are explained. The theoretical and practical implications of using co-occurrence graphs are also discussed, pointing out the contributions and challenges of modeling text document as graphs.

erasmus mundus masters program in language and communication technology A graph model for text analysis and text mining

Automated text analysis and text mining methods have received a great deal of attention because of the remarkable increase of digital documents. Typical tasks involved in these two areas include text classification, information extraction, document summarization, text pattern mining etc. Most of them are based on text representation models which are used to represent text content. The traditional text representation method, Vector Space Model, has several noticeable weak points with respect to the ability of capturing text structure and the semantic information of text content. Recently, instead of using Vector Space Model, graph-based models have emerged as alternatives to text representation model. However, it is still difficult to include semantic information into these graph-based models. In this thesis, we propose FrameNet-based Graph Model for Text (FGMT), a new graph model that contains structural and shallow semantic information of text by using FrameNet resource. Moreover, we introduce a Hybrid model based on FGMT which is more adapted to text classification. The experiment results show a significant improvement in classification by using our models versus a typical Vector Space Model.

Extracting community structure features for hypertext classification

2008

Standard text classification techniques assume that all documents are independent and identically distributed (i.i.d.). However, hypertext documents such as web pages are interconnected with links. How to take advantage of such links as extra evidences to enhance automatic classification of hypertext documents is a non-trivial problem. We think a collection of interconnected hypertext documents can be considered as a complex network, and the underlying community structure of such a document network contains valuable clues about the right classification of documents. This paper introduces a new technique, Modularity Eigenmap, that can effectively extract community structure features from the document network which is induced from document link information only or constructed by combining both document content and document link information. A number of experiments on real-world benchmark datasets show that the proposed approach leads to excellent classification performance in comparison with the state-of-the-art methods.

Features Selection for Supervised Learning Using Centrality Measures

Revista Gestão Inovação e Tecnologias, 2021

The data mining methods have been extensively used in the process of decision making. The popularity of data mining methods is due to availability of high speed algorithms, processing and storage power of computers. The effective use of data mining methods help in mining datasets and taking better decisions. The data need to be preprocessed before applying data mining methods. Some datasets require little preparation like dealing with missing and redundant instances while some high-dimensional datasets require strong processing like dimensionality reduction. One of the techniques used for dimensionality reduction is feature selection. This study uses graph based centrality measure for feature selection. Graph based centrality measures are used for ranking features which is used for removing irrelevant attributes. After comparison of results with other approaches, it has been found that the proposed approach results in reduction of feature space without compromising accuracy. The res...

A network approach to dimensionality reduction in Text Mining

2018

The ever-increasing popularity of the Internet, together with the amazing progress of computer technology, has led to a tremendous growth in the availability of electronic documents. There is a great interest in developing statistical tools for the effective and efficient extraction of information from documental repositories on the Web. The most common reference model for representing documents is the so-called vector space model. Documents are coded as bag-of-words, i.e. as an unordered set of terms, disregarding grammatical and syntactical roles. The focus is on the presence/absence of a term in a document, its characterisation and discrimination power. The knowledge discovery process implies a dimensionality reduction step, both via feature selection and/or feature extraction. Here we propose a novel strategy designed for dimensionality reduction in a Text Mining frame. The idea is that textual data can be processed at different levels, e.g. as single terms or subsets of terms i...

Role of Natural Language Processing in Community Structure Detection

Proceedings of the 4th National …, 2010

In this paper, relationship as a metric in the community is introduced for community detection in a social network. The relationship metric is inspired by Newman's Edge Betweenness metric but an attempt to look at the community detection problem from the angle of Natural Language Processing. In this approach, a social network is developed on the basis of blog data or email archives and later it's lexical and semantic analysis is done to find out the relationships between each and every individual. Finally a social network structure is achieved where vertices represent individuals and edges between vertices represent their relationships. To detect the communities, it is needed to decide what type of community is of our interest. In other words we need to decide the relationship metric of our interest and then relationships (edges) which are not of our interest are removed. Thus a social network of edges having the same type of relationship is achieved. On this achieved social network, a divisive algorithm proposed by Girvan and Newman is applied that uses edge betweeness as a metric to identify the boundaries of communities.