Binning: Converting numerical classification into text classification (original) (raw)

Consider a supervised learning problem in which examples contain both numerical-and text-valued features. One common approach to this problem would be to treat the presence or absence of a word as a Boolean feature, which when combined with the other numerical features enables the application of a range of traditional feature-vector-based learning methods. This paper presents an alternative approach, in which numerical features are converted into "bag of word" features, enabling instead the use of a range of existing text-classification methods. Our approach creates a set of bins for each feature into which its observed values can fall. Two tokens are defined for each bin endpoint, representing which side of a bin's endpoint a feature value lies. A numerical feature is then assigned the bag of tokens appropriate for its value. Not only does this approach now make it possible to apply text-classification methods to problems involving both numerical and text-valued features, even problems that contain solely numerical features can be converted using this representation so that text-classification methods can be applied. We therefore evaluate our approach both on a range of real-world datasets taken from the UCI Repository that solely involve numerical features, as well as on additional datasets that contain both numerical-and text-valued features. Our results show that the performance of the text-classification methods using the binning representation often meets or exceeds that of traditional supervised learning methods (C4.5, k-NN, NBC, and Ripper), even on existing numericalfeature-only datasets from the UCI Repository, suggesting that text-classification methods, coupled with binning, can serve as a credible learning approach for traditional supervised learning problems.