Discriminative features for text document classification (original) (raw)

Discriminative features for document classification

Object recognition supported by user interaction for service robots, 2002

The bag-of-words approach to text document representation typically results in vectors of the order of 5000 to 20000 components as the representation of documents. In order to make effective use of various statistical classifiers, it may be necessary to reduce the dimensionality of this representation. We point out deficiencies in class discrimination of two popular such methods, Latent Semantic Indexing (LSI), and sequential feature selection according to some relevant criterion.

Linear discriminant analysis in document classification

2001

Document representation using the bag-of-words approach may require bringing the dimensionality of the representation down in order to be able to make effective use of various statistical classification methods. Latent Semantic Indexing (LSI) is one such method that is based on eigendecomposition of the covariance of the document-term matrix. Another often used approach is to select a small number of most important features out of the whole set according to some relevant criterion. This paper points out that LSI ignores discrimination while concentrating on representation. Furthermore, selection methods fail to produce a feature set that jointly optimizes class discrimination. As a remedy, we suggest supervised linear discriminative transforms, and report good classification results applying these to the Reuters-21578 database.

Text Categorization: An extensive comparison of classifiers, feature selection metrics and document representation

2011

In this paper, we compare several aspects related to automatic text categorization which include document representation, feature selection, three classifiers, and their application to two language text collections. Regarding the computational representation of documents, we compare the traditional bag of words representation with 4 other alternative representations: bag of multiwords and bag of word prefixes with N characters (for N = 4, 5 and 6). Concerning the feature selection we compare the well known feature selection metrics Information Gain and Chi-Square with a new one based on the third moment statistics which enhances rare terms. As to the classifiers, we compare the well known Support Vector Machine and K-Nearest Neighbor classifiers with a classifier based on Mahalanobis distance. Finally, the study performed is language independent and was applied over two document collections, one written in English (Reuters-21578) and the other in Portuguese (Folha de São Paulo).

Classification of Text Documents

The exponential growth of the internet has led to a great deal of interest in developing useful and efficient tools and software to assist users in searching the Web. Document retrieval, categorization, routing and filtering can all be formulated as classification problems. However, the complexity of natural languages and the extremely high dimensionality of the feature space of documents have made this classification problem very difficult. We investigate four different methods for document classification: the naive Bayes classifier, the nearest neighbour classifier, decision trees and a subspace method. These were applied to seven-class Yahoo news groups (business, entertainment, health, international, politics, sports and technology) individually and in combination. We studied three classifier combination approaches: simple voting, dynamic classifier selection and adaptive classifier combination. Our experimental results indicate that the naive Bayes classifier and the subspace method outperform the other two classifiers on our data sets. Combinations of multiple classifiers did not always improve the classification accuracy compared to the best individual classifier. Among the three different combination approaches, our adaptive classifier combination method introduced here performed the best. The best classification accuracy that we are able to achieve on this seven-class problem is approximately 83%, which is comparable to the performance of other similar studies. However, the classification problem considered here is more difficult because the pattern classes used in our experiments have a large overlap of words in their corresponding documents.

Efficient multi-way text categorization via generalized discriminant analysis

2003

Abstract Text categorization is an important research area and has been receiving much attention due to the growth of the on-line information and of Internet. Automated text categorization is generally cast as a multi-class classification problem. Much of previous work focused on binary document classification problems. Support vector machines (SVMs) excel in binary classification, but the elegant theory behind large-margin hyperplane cannot be easily extended to multi-class text classification.

Text categorization via generalized discriminant analysis

2008

Text categorization is an important research area and has been receiving much attention due to the growth of the on-line information and of Internet. Automated text categorization is generally cast as a multi-class classification problem. Much of previous work focused on binary document classification problems. Support vector machines (SVMs) excel in binary classification, but the elegant theory behind large-margin hyperplane cannot be easily extended to multi-class text classification.

Text Classification for Intelligent

2004

In the application domain of stock portfolio management, software agents that evaluate the risks associated with the individual companies of a portfolio should be able to read electronic news articles that are written to give investors an indication of the financial outlook of a company. There is a positive correlation between news reports on a company's financial outlook and the company's attractiveness as an investment. However, because of the volume of such reports, it is impossible for financial analysts or investors to track and read each one. Therefore, it would be very helpful to have a system that automatically classifies news reports that reflect positively or negatively on a company's financial outlook. To accomplish this task, we treat the analysis of news articles as a text classification problem. We developed a text classification algorithm that classifies financial news article by using a combination of a reduced but highly informative word feature sets and a variant of weighted majority algorithm. By clustering words represented in latent semantic vector space by LSA into groups with similar concepts, we are able to find semantically coherent word groups. A learning method with unlabeled data "Self-Confident" sampling was proposed to handle the problem of expensive data labeling. Vote entropy is the criterion that information-theoretically assigns a label to an unlabeled document. In comparison with naive Bayes classification boosted by Expectation Maximization (EM), the proposed method showed a better performance in terms of accuracy. Two criteria are used to evaluate methods: how well they improve their performances with unlabeled data after being initially trained on a small number of human-labeled articles and how well they classify the latest financial news articles which are mostly not seen during the training. The contribution of this work lies in the new classification method that we propose and in the sampling technique we used for improving classification accuracy.

A survey on text classification and its applications

Web Intelligence, 2020

Text classification (a.k.a text categorisation) is an effective and efficient technology for information organisation and management. With the explosion of information resources on the Web and corporate intranets continues to increase, it has being become more and more important and has attracted wide attention from many different research fields. In the literature, many feature selection methods and classification algorithms have been proposed. It also has important applications in the real world. However, the dramatic increase in the availability of massive text data from various sources is creating a number of issues and challenges for text classification such as scalability issues. The purpose of this report is to give an overview of existing text classification technologies for building more reliable text classification applications, to propose a research direction for addressing the challenging problems in text mining.

Survey Paper on Feature Extraction Methods in Text Categorization

International Journal of Computer Applications

As the world is moving towards globalization, digitization of text has been escalating a lot and the need to organize, categorize and classify text has become obligatory. Disorganization or little categorization and sorting of text may result in dawdling response time of information retrieval. There has been the 'curse of dimensionality' (as termed by Bellman)[1] problem, namely the inherent sparsity of high dimensional spaces. Thus, the search for a possible presence of some unspecified structure in such a high dimensional space can be difficult. This is the task of feature reduction methods. They obtain the most relevant information from the original data and represent the information in a lower dimensionality space. In this paper, all the applied methods on feature extraction on text categorization from the traditional bag-of-words model approach to the unconventional neural networks are discussed.