Text Categorization: An extensive comparison of classifiers, feature selection metrics and document representation (original) (raw)

Analysis of feature selection measures for text categorisation

International Journal of Enterprise Network Management, 2017

The curse of dimensionality has made dimension reduction an essential step in text categorisation. Feature selection is an approach for dimension reduction. In this paper an analysis on feature selection measures for text categorisation is performed. Under the unsupervised approach document frequency and under the supervised approach chi-square, odds ratio, mutual information, and information gain are considered for analysis. They are considered here because they are the widely used and effective measures. Analysis of these measures is performed using the 20 newsgroups dataset. Twenty newsgroups dataset consists of closely related categories as well as highly unrelated categories. Certain categories of 20 newsgroups dataset are selected and organised into three groups of overlapping (highly related) classes, non-overlapping (highly unrelated) classes and combination of overlapping and non-overlapping classes. Feature selection and subsequent classification is applied to the three groups separately and the classification performance is studied based on the feature selection measures. The noticeable behaviour was with odds ratio measure in that it performed well for non-overlapping group and overlapping groups considered separately and was poorer in performance for the group containing both overlapping and non-overlapping categories. Remaining measures showed consistent behaviour for all the three groups. Classification was achieved using support vector machine classifier. The performance comparisons of different measures on different groups are presented in terms of micro-F 1 and macro-F 1 .

A novel feature selection algorithm for text categorization

Expert Systems With Applications, 2007

With the development of the web, large numbers of documents are available on the Internet. Digital libraries, news sources and inner data of companies surge more and more. Automatic text categorization becomes more and more important for dealing with massive data. However the major problem of text categorization is the high dimensionality of the feature space. At present there are many methods to deal with text feature selection. To improve the performance of text categorization, we present another method of dealing with text feature selection. Our study is based on Gini index theory and we design a novel Gini index algorithm to reduce the high dimensionality of the feature space. A new measure function of Gini index is constructed and made to fit text categorization. The results of experiments show that our improvements of Gini index behave better than other methods of feature selection.

An empirical evaluation of text classification and feature selection methods

Artificial Intelligence Research, 2016

An extensive empirical evaluation of classifiers and feature selection methods for text categorization is presented. More than 500 models were trained and tested using different combinations of corpora, term weighting schemes, number of features, feature selection methods and classifiers. The performance measures used were micro-averaged F measure and classifier training time. The experiments used five benchmark corpora, three term weighting schemes, three feature selection methods and four classifiers. Results indicated only slight performance improvement with all the features over only 20% features selected using Information Gain and Chi Square. More importantly, this performance improvement was not deemed statistically significant. Support Vector Machine with linear kernel reigned supreme for text categorization tasks producing highest F measures and low training times even in the presence of high class skew. We found statistically significant difference between the performance of Support Vector Machine and other classifiers on text categorization problems.

Words as Rules: Feature Selection in Text Categorization

Lecture Notes in Computer Science, 2004

In Text Categorization problems usually there is a lot of noisy and irrelevant information present. In this paper we propose to apply some measures taken from the Machine Learning environment for Feature Selection. The classifier used is Support Vector Machines. The experiments over two different corpora show that some of the new measures perform better than the traditional Information Theory measures.

Two New Approaches to Feature Selection for Document Categorization

Due to the huge volume of text documents available on the Internet, it is increasingly necessary to effectively manage them and then help users to retrieve what they want. Document categorization can organize documents into domain specific classes and so facilitate information retrieval. In general, most of document categorization systems are composed of three kinds of models: one for weighting terms, the second for selecting feature terms and the third for categorizing documents accordingly. In practice, a document categorization system is essentially one of the different combinations of these models. Based on the observation of model coherence used in document categorization system, this paper proposes two approaches CBA and IBA to feature selection. The empirical results done with k-Nearest Neighbors and naïve Bayes classifiers against Reuters-21578 corpus show that CBA and IBA are comparable to c 2 feature selection model.

Automated Text Categorization with Machine Learning and its Application in Multilingual Text Categorization

The automated categorization (or classification) of texts into predefined categories is one of the booming field of text mining. Now a days availability of digital data is very high, and to manage them in predefine categories becomes challenging task. Machine learning is a technique by which we can make automated classifier to classify the document with minimum human assistance. The advantages of this approach over the knowledge engineering approach are effectiveness, considerable savings in terms of expert labor power, and straightforward portability to different domains. This paper discusses the Naïve Bayes, Rocchio and kNN methods within machine learning paradigm for automated text categorization of document in predefined categories. We are also like to discuss multilingual text categorization, that consists in classifying documents in different languages according to the same classification tree.

Automatic Text Classification: A Comparative Study

The massive amount of semi-structured data contained within the text documents makes the process of classifying them manually a very difficult task. Automatic text classification is the process of classifying documents based on their contents into a predefined set of categories. This paper provides a comparison of the performance of well-known text classification techniques including genetic algorithm, k nearest neighbor, decision tree, support vector machine and Naïve Bayes. Light stemmer and Chi method have been implemented as preprocessing and features selection techniques. The effectiveness of the classifiers is evaluated in terms of macro-average F1 measure. In order to evaluate the five classification techniques, a text corpus has been collected. Results showed that the performance of the support vector machine and the Naïve Bayes classifiers outperforms the other classifiers in term of the classification accuracy.

COMPARATIVE STUDY OF CLASSIFICATION ALGORITHM FOR TEXT BASED CATEGORIZATION

Text categorization is a process in data mining which assigns predefined categories to free-text documents using machine learning techniques. Any document in the form of text, image, music, etc. can be classified using some categorization techniques. It provides conceptual views of the collected documents and has important applications in the real world. Text based categorization is made use of for document classification with pattern recognition and machine learning. Advantages of a number of classification algorithms have been studied in this paper to classify documents. An example of these algorithms is: Naive Bayes' algorithm, K-Nearest Neighbor, Decision Tree etc. This paper presents a comparative study of advantages and disadvantages of the above mentioned classification algorithm.

Text Categorization Optimization By A Hybrid Approach Using Multiple Feature Selection And Feature Extraction Methods

Text categorization is the task of automatically sorting a set of documents into categories from a pre-defined set. This means it assigns predefines categories to free-text documents. In this paper we are proposing a unique two stage feature selection method for text categorization by using information gain, principle component analysis and genetic algorithm. In the methodology, every term inside the document is ranked depending on their importance for classification using the information gain (IG) methodology. In the second stage, genetic algorithm (GA) and principal component analysis (PCA) feature selection and feature extraction methods are applied individually to the terms which are ranked in decreasing order of importance, and a dimension reduction is carried out. Therefore, throughout the text categorization terms of less importance are ignored, and feature selection and extraction methods are applied to the terms of highest importance; so the computational time and complexity of categorization is reduced. To analyze the dimension reduction in our proposed model, experiments area unit conducted using the k-nearest neighbor(KNN) and C4.5 decision tree algorithmic rule on selected data set.

A Survey on Machine Learning Based Text Categorization

2018

Due to the availability of documents in the digital form becoming enormous the need to access them into more adjustable way becoming extremely important. In this context,document management tasks based on content is called as IR or Information Retrieval. Thishas achieved a noticeable position in the area of information system.For faster response time of IR,it is very important and essential to organize,categorize and classify texts and digital documents according to the definitions,proposed by Text Mining experts and Computer scientists.Automatic text Categorization or Topic Spotting,is a process to sorta document set automatically into categories from a predefined set.According to researchers the superior access to this problem depends on machine learning methods in which,a general posteriori process builds a classifier automatically by learning pre-classified documents given and the category’s characteristics.The acceptance of automatic text categorization is done becauseit is fre...