A Bayesian Classification Approach Using Class-Specific Features for Text Categorization (original) (raw)
Related papers
A Novel Fuzzy-Bayesian Classification Method for Automatic Text Categorization
Text categorization is mostly required to label the documents automatically with the predefined set of topics. It has been achieved by the large number of advanced machine learning algorithms. In the proposed system, fuzzy rule along with Bayesian classification method is proposed for automatic text categorization using the class-specific features. The proposed method selects the particular feature subset for each class. Then, these class features are applied for the classification. To achieve this, Baggenstoss's PDF Projection Theorem is followed to reconstruct PDF in raw data space from the class-specific PDF in low-dimensional feature space and build the fuzzy based Bayes classification rule. The noticeable significance of this method is that most feature selection criteria such as information gain and maximum discrimination which can be easily incorporated into the proposed method. The proposed classification performance is evaluated on different datasets and compared with the different feature selection methods. The experimental results illustrate that the effectiveness of the proposed method and further indicates its wide applications in text categorization.
A Probabilistic Approach to Feature Selection for Multi-class Text Categorization
Lecture Notes in Computer Science, 2007
In this paper, we propose a probabilistic approach to feature selection for multi-class text categorization. Specifically, we regard document class and occurrence of each feature as events, calculate the probability of occurrence of each feature by the theorem on the total probability and utilize the values as a ranking criterion. Experiments on Reuters-2000 collection show that the proposed method can yield better performance than information gain and χ-square, which are two wellknown feature selection methods.
An Evident Theoretic Feature Selection Approach for Text Categorization
enggjournals.com
With the exponential growth of textual documents available in unstructured form on the Internet, feature selection approaches are increasingly significant for the preprocessing of textual documents for automatic text categorization. Feature selection, which focuses on identifying relevant and informative features, can help reduce the computational cost of processing voluminous amounts of data as well as increase the effectiveness for the subsequent text categorization tasks. In this paper, we propose a new evident theoretic feature selection approach for text categorization based on transferable belief model (TBM). An evaluation on the performance of the proposed evident theoretic feature selection approach on benchmark dataset is also presented. We empirically show the effectiveness of our approach in outperforming the traditional feature selection methods using two standard benchmark datasets.
Efficient Text Categorization using Naïve Bayes Classification
2017
Text classification is the undertaking of naturally sorting an arrangement of archives into classifications from a predefined set. Content Classification is an information mining procedure used to anticipate bunch enrollment for information occurrences inside a given dataset. It is utilized for ordering information into various classes by thinking of some as compels. Rather than conventional component determination systems utilized for content archive grouping. We present another model in view of likelihood and over all class recurrence of term. The Naive Bayesian classifier depends on Bayes hypothesis with autonomy presumptions between indicators. A Naive Bayesian model is anything but difficult to work, with no confounded iterative parameter estimation which makes it especially valuable for substantial datasets. The paper demonstrates that the new probabilistic translation of tf×idf term weighting may prompt better comprehension of measurable positioning instruments. .
Information Processing & Management, 2006
Most previous works of feature selection emphasized only the reduction of high dimensionality of the feature space. But in cases where many features are highly redundant with each other, we must utilize other means, for example, more complex dependence models such as Bayesian network classifiers. In this paper, we introduce a new information gain and divergence-based feature selection method for statistical machine learning-based text categorization without relying on more complex dependence models. Our feature selection method strives to reduce redundancy between features while maintaining information gain in selecting appropriate features for text categorization. Empirical results are given on a number of dataset, showing that our feature selection method is more effective than Koller and SahamiÕs method [Koller, D., & Sahami, M. (1996). Toward optimal feature selection. In Proceedings of ICML-96, 13th international conference on machine learning], which is one of greedy feature selection methods, and conventional information gain which is commonly used in feature selection for text categorization. Moreover, our feature selection method sometimes produces more improvements of conventional machine learning algorithms over support vector machines which are known to give the best classification accuracy.
2011
In this paper, we compare several aspects related to automatic text categorization which include document representation, feature selection, three classifiers, and their application to two language text collections. Regarding the computational representation of documents, we compare the traditional bag of words representation with 4 other alternative representations: bag of multiwords and bag of word prefixes with N characters (for N = 4, 5 and 6). Concerning the feature selection we compare the well known feature selection metrics Information Gain and Chi-Square with a new one based on the third moment statistics which enhances rare terms. As to the classifiers, we compare the well known Support Vector Machine and K-Nearest Neighbor classifiers with a classifier based on Mahalanobis distance. Finally, the study performed is language independent and was applied over two document collections, one written in English (Reuters-21578) and the other in Portuguese (Folha de São Paulo).
A framework of feature selection methods for text categorization
Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - ACL-IJCNLP '09, 2009
In text categorization, feature selection (FS) is a strategy that aims at making text classifiers more efficient and accurate. However, when dealing with a new task, it is still difficult to quickly select a suitable one from various FS methods provided by many previous studies. In this paper, we propose a theoretic framework of FS methods based on two basic measurements: frequency measurement and ratio measurement. Then six popular FS methods are in detail discussed under this framework. Moreover, with the guidance of our theoretical analysis, we propose a novel method called weighed frequency and odds (WFO) that combines the two measurements with trained weights. The experimental results on data sets from both topic-based and sentiment classification tasks show that this new method is robust across different tasks and numbers of selected features.
Text categorization is the task of automatically sorting a set of documents into categories from a pre-defined set. This means it assigns predefines categories to free-text documents. In this paper we are proposing a unique two stage feature selection method for text categorization by using information gain, principle component analysis and genetic algorithm. In the methodology, every term inside the document is ranked depending on their importance for classification using the information gain (IG) methodology. In the second stage, genetic algorithm (GA) and principal component analysis (PCA) feature selection and feature extraction methods are applied individually to the terms which are ranked in decreasing order of importance, and a dimension reduction is carried out. Therefore, throughout the text categorization terms of less importance are ignored, and feature selection and extraction methods are applied to the terms of highest importance; so the computational time and complexity of categorization is reduced. To analyze the dimension reduction in our proposed model, experiments area unit conducted using the k-nearest neighbor(KNN) and C4.5 decision tree algorithmic rule on selected data set.
A novel feature selection algorithm for text categorization
Expert Systems With Applications, 2007
With the development of the web, large numbers of documents are available on the Internet. Digital libraries, news sources and inner data of companies surge more and more. Automatic text categorization becomes more and more important for dealing with massive data. However the major problem of text categorization is the high dimensionality of the feature space. At present there are many methods to deal with text feature selection. To improve the performance of text categorization, we present another method of dealing with text feature selection. Our study is based on Gini index theory and we design a novel Gini index algorithm to reduce the high dimensionality of the feature space. A new measure function of Gini index is constructed and made to fit text categorization. The results of experiments show that our improvements of Gini index behave better than other methods of feature selection.