Learning Phrase Patterns for Text Classification (original) (raw)

This paper introduces methods to discriminatively learn phrase patterns for use as features in text classification. An efficient solution is described using a recursive algorithm with a mutual information selection criterion. The algorithm automatically determines when word classes are useful in specific locations of a phrase pattern, allowing for variable specificity depending on the amount of labeled data available. Experiments are carried out on three text classification tasks in both English and Chinese, resulting in improved performance when adding the phrase patterns to the existing n-gram features. Index Terms text classification, natural language processing, phrase pattern, mutual information. I. INTRODUCTION Text classification is an important natural language processing application. There has been a great amount of work in this area, especially since the growth in the variety of text types available on the web. Previous text classification research includes topic categorization, genre classification, reading level detection, role classification, and sentiment analysis, to name a few. A typical text classification system is composed of a feature extractor and a classifier. Much of the previous research has focused on using more advanced classifiers, with feature extraction mainly based on the use of words and n-grams. While topic classification achieves high accuracy using words as features [1], this may be a reflection of the topic classification problem and not of text classification more generally. Classification performance is