Binning: Converting numerical classification into text classification (original) (raw)
Consider a supervised learning problem in which examples contain both numerical-and text-valued features. One common approach to this problem would be to treat the presence or absence of a word as a Boolean feature, which when combined with the other numerical features enables the application of a range of traditional feature-vector-based learning methods. This paper presents an alternative approach, in which numerical features are converted into "bag of word" features, enabling instead the use of a range of existing text-classification methods. Our approach creates a set of bins for each feature into which its observed values can fall. Two tokens are defined for each bin endpoint, representing which side of a bin's endpoint a feature value lies. A numerical feature is then assigned the bag of tokens appropriate for its value. Not only does this approach now make it possible to apply text-classification methods to problems involving both numerical and text-valued features, even problems that contain solely numerical features can be converted using this representation so that text-classification methods can be applied. We therefore evaluate our approach both on a range of real-world datasets taken from the UCI Repository that solely involve numerical features, as well as on additional datasets that contain both numerical-and text-valued features. Our results show that the performance of the text-classification methods using the binning representation often meets or exceeds that of traditional supervised learning methods (C4.5, k-NN, NBC, and Ripper), even on existing numericalfeature-only datasets from the UCI Repository, suggesting that text-classification methods, coupled with binning, can serve as a credible learning approach for traditional supervised learning problems.
Related papers
Converting numerical classification into text classification
Artificial Intelligence, 2003
Consider a supervised learning problem in which examples contain both numerical-and text-valued features. One common approach to this problem would be to treat the presence or absence of a word as a Boolean feature, which when combined with the other numerical features enables the application of a range of traditional feature-vector-based learning methods. This paper presents an alternative approach, in which numerical features are converted into "bag of word" features, enabling instead the use of a range of existing text-classification methods. Our approach creates a set of bins for each feature into which its observed values can fall. Two tokens are defined for each bin endpoint, representing which side of a bin's endpoint a feature value lies. A numerical feature is then assigned the bag of tokens appropriate for its value. Not only does this approach now make it possible to apply text-classification methods to problems involving both numerical and text-valued features, even problems that contain solely numerical features can be converted using this representation so that text-classification methods can be applied. We therefore evaluate our approach both on a range of real-world datasets taken from the UCI Repository that solely involve numerical features, as well as on additional datasets that contain both numerical-and text-valued features. Our results show that the performance of the text-classification methods using the binning representation often meets or exceeds that of traditional supervised learning methods (C4.5, k-NN, NBC, and Ripper), even on existing numericalfeature-only datasets from the UCI Repository, suggesting that text-classification methods, coupled with binning, can serve as a credible learning approach for traditional supervised learning problems.
Using text classifiers for numerical classification
2001
Consider a supervised learning problem in which examples contain both numerical-and text-valued features. To use traditional featurevector-based learning methods, one could treat the presence or absence of a word as a Boolean feature and use these binary-valued features together with the numerical features. However, the use of a text-classification system on this is a bit more problematic-in the most straightforward approach each number would be considered a distinct token and treated as a word. This paper presents an alternative approach for the use of text classification methods for supervised learning problems with numerical-valued features in which the numerical features are converted into bag-of-words features, thereby making them directly usable by text classification methods. We show that even on purely numerical-valued data the results of textclassification on the derived text-like representation outperforms the more naive numbers-as-tokensrepresentation and, more importantly, is competitive with mature numerical classification methods such as C4.5 and Ripper.
Feature engineering for text classification
1999
Most research in text classification has used the "bag of words" representation of text. This paper examines some alternative ways to represent text based on syntactic and semantic relationships between words (phrases, synonyms and hypernyms). We describe the new representations and try to justify our suspicions that they could have improved the performance of a rule-based learner. The representations are evaluated using the RIPPER rule-based learner on the Reuters-21578 and DigiTrad test corpora, but on their own the new representations are not found to produce a significant performance improvement. Finally, we try combining classifiers based on different representations using a majority voting technique. This step does produce some performance improvement on both test collections. In general, our work supports the emerging consensus in the information retrieval community that more sophisticated Natural Language Processing techniques need to be developed before better text representations can be produced. We conclude that for now, research into new learning algorithms and methods for combining existing learners holds the most promise.
Improving binary classification on text problems using differential word features
2009
Abstract We describe an efficient technique to weigh word-based features in binary classification tasks and show that it significantly improves classification accuracy on a range of problems. The most common text classification approach uses a document's ngrams (words and short phrases) as its features and assigns feature values equal to their frequency or TFIDF score relative to the training corpus.
Almost all of the machine learning problems require data preprocessing. This stage is especially important for problems where the datasets contain features of mixed types (i.e. nominal and numeric). An often practice in such cases is to transform each nominal features into many dummy (i.e. binary) features. Also many classification algorithms have preference of numeric attributes over nominal attributes, and sometimes the distance between different data points cannot be estimated if the values of the attributes are not numeric and normalized. One way to transform nominal into numeric features is to use the Weight of Evidence (WoE) technique. WoE has some properties that make it very useful tool for transformation of attributes, but unfortunately there are some preconditions that need to be met in order to calculate it. Additionally WoE originally works only on supervised learning problems where data is labeled with two classes. In this paper we propose modified calculation of the We...
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.