A novel text classification problem and its solution (original) (raw)

Classification of Text Documents

The exponential growth of the internet has led to a great deal of interest in developing useful and efficient tools and software to assist users in searching the Web. Document retrieval, categorization, routing and filtering can all be formulated as classification problems. However, the complexity of natural languages and the extremely high dimensionality of the feature space of documents have made this classification problem very difficult. We investigate four different methods for document classification: the naive Bayes classifier, the nearest neighbour classifier, decision trees and a subspace method. These were applied to seven-class Yahoo news groups (business, entertainment, health, international, politics, sports and technology) individually and in combination. We studied three classifier combination approaches: simple voting, dynamic classifier selection and adaptive classifier combination. Our experimental results indicate that the naive Bayes classifier and the subspace method outperform the other two classifiers on our data sets. Combinations of multiple classifiers did not always improve the classification accuracy compared to the best individual classifier. Among the three different combination approaches, our adaptive classifier combination method introduced here performed the best. The best classification accuracy that we are able to achieve on this seven-class problem is approximately 83%, which is comparable to the performance of other similar studies. However, the classification problem considered here is more difficult because the pattern classes used in our experiments have a large overlap of words in their corresponding documents.

A New Two-Stage Approach to the Multiaspect Text Categorization

2015 IEEE Symposium Series on Computational Intelligence, 2015

We consider a particular type of text categorization problem which we refer to as the multiaspect classification. It is inspired by some practical scenario of business documents management in a company but has a broader application potential. A distinguishing feature of the new problem considered is the existence of two schemes of classification. The first one is based on the traditional, static set of text categories, possibly arranged into a hierarchy. The second one is based on a dynamic structure of sequences of documents, referred to as cases, identified within each category. While the former problem may be addressed using one of the well known techniques of text categorization (classification), the latter seems to require some distinct approaches due to the fact that the set of cases is unknown in advance, as well as due to the assumed limited number of training documents, if a case should be interpreted as a classic category. In the paper, we discuss the problem in a more detail as well as show the applicability of an intuitively appealing two stage approach to solving the problem of such a multiaspect text categorization.

Text categorization with hierarhical category structure

In this paper we describe an adaptive text categorization algorithm that is able to learn hierarchical category structures. This work was first initiated by a knowledge engineering problem. We developed a tool that provides domain engineers with a facility to create fuzzy relational thesauri (FRT) describing subject domains. Fuzzy thesauri have the ability to describe a specific domain by hierarchically organized significant words (concepts or instances) and their relationships. Creation of such structures is usually quite costly in terms of time and hence money. To fasten this procedure we aimed at providing keywords to each node of a hierarchy. These keywords being collected from categorized documents are very useful as candidates to expand FRT, because domain engineers can shorten their search time significantly when prospective candidates are offered.

Sketching a “low-cost” text-classification technique for text topics in English

Ibérica, 2014

The aim of this paper is to sketch a potential methodology for automatic text classification which allows text topic discrimination as a prior step to new case assignment to previously established text topics. Such case assignment will be performed by means of Discriminant Function Analysis based on a series of easily-computable linguistic parameters, in order to reduce computational costs. Resumen Esbozo de una técnica de clasific ación de "bajo coste" según temáticas para textos en lengua inglesa El objetivo de este artículo es esbozar una posible metodología para la clasificación automática de textos que permita la discriminación temática, como paso previo a la asignación de casos nuevos de textos a temáticas previamente establecidas. Dicha clasificación/asignación temática se implementa mediante el análisis discriminante y se sustenta en una serie de parámetros lingüísticos de fácil obtención, con el fin de reducir costes computacionales. Palabras clave: clasificación automática de textos, análisis discriminante, funciones clasificatorias, lengua inglesa, categorías textuales.

Document Classification : A Review

2018

As most information is stored as text in web, text document classification is considered to have a high commercial value. Text classification is classifying the documents according to predefined categories. Complexity of natural languages and the very high dimensionality of the feature space of documents have made this classification problem difficult. In this paper we have given the introduction of text classification, process of text classification, overview of the classifiers and compared some existing classifier on basis of few criteria like time principle, merits and demerits.

Text Document Classification: An Approach Based on Indexing

International Journal of Data Mining & Knowledge Management Process, 2012

In this paper we propose a new method of classifying text documents. Unlike conventional vector space models, the proposed method preserves the sequence of term occurrence in a document. The term sequence is effectively preserved with the help of a novel datastructure called 'Status Matrix'. Further the corresponding classification technique has been proposed for efficient classification of text documents. In addition, in order to avoid sequential matching during classification, we propose to index the terms in Btree, an efficient index scheme. Each term in B-tree is associated with a list of class labels of those documents which contain the term. Further the corresponding classification technique has been proposed. To corroborate the efficacy of the proposed representation and status matrix based classification, we have conducted extensive experiments on various datasets.

Techniques for text classification: Literature review and current trends

Webology, 2015

Automated classification of text into predefined categories has always been considered as a vital method to manage and process a vast amount of documents in digital forms that are widespread and continuously increasing. This kind of web information, popularly known as the digital/electronic information is in the form of documents, conference material, publications, journals, editorials, web pages, e-mail etc. People largely access information from these online sources rather than being limited to archaic paper sources like books, magazines, newspapers etc. But the main problem is that this enormous information lacks organization which makes it difficult to manage. Text classification is recognized as one of the key techniques used for organizing such kind of digital data. In this paper we have studied the existing work in the area of text classification which will allow us to have a fair evaluation of the progress made in this field till date. We have investigated the papers to the ...

Machine learning in automated text categorization

ACM Computing Surveys, 2002

The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last 10 years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this problem is based on machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of preclassified documents, the characteristics of the categories. The advantages of this approach over the knowledge engineering approach (consisting in the manual definition of a classifier by domain experts) are a very good effectiveness, considerable savings in terms of expert labor power, and straightforward portability to different domains. This survey discusses the main approaches to text categorization that fall within the machine learning paradigm. We will discuss in detail issues pertaining to three different problems, namely, document representation, classifier construction, and classifier evaluation.

A Survey on Machine Learning Based Text Categorization

2018

Due to the availability of documents in the digital form becoming enormous the need to access them into more adjustable way becoming extremely important. In this context,document management tasks based on content is called as IR or Information Retrieval. Thishas achieved a noticeable position in the area of information system.For faster response time of IR,it is very important and essential to organize,categorize and classify texts and digital documents according to the definitions,proposed by Text Mining experts and Computer scientists.Automatic text Categorization or Topic Spotting,is a process to sorta document set automatically into categories from a predefined set.According to researchers the superior access to this problem depends on machine learning methods in which,a general posteriori process builds a classifier automatically by learning pre-classified documents given and the category’s characteristics.The acceptance of automatic text categorization is done becauseit is fre...

Machine Learning for Text Categorization

Machine Learning for Text Categorization, 2006

A long standing goal for the field of Artificial Intelligence (AI) is to enable computer understanding of human languages. Much progress has been made in reaching this goal, but much also remains to be done. Concept maps are considered by some educational psychologists as a very important tool to improve learning. Moreover with the rapid spread of the internet and the increase of online information, the new technology for automatically classifying huge amounts of diverse text information has come to play a very important role. This led to the use of the machine learning approach, which is a method of creating classifiers automatically from the text data given in a category label. This dissertation presents a research on the field of AI via studying machine learning for natural language understanding. One important part of the process of understanding a text consists in apprehending its underlying interrelations of concepts. The proposed system aims to extracts concepts from text printed of natural language in constructing two models in the following steps :- 1. Extract concepts and relations between them. 2. Classify sets of documents written some of them in different domains. 3. In spite of the complexity of natural languages the proposed system with the assistance of user offers creation of interactive interface for structured query and complete the concepts relations before extracting the desired information from one or a lot of documents in specific domain using Inductive Logic Programming(ILP). Our examples focus on a text written in English natural language. Extracted data are particularly useful for obtaining a structured database from unstructured documents, and preparing information for database entries. This dissertation discusses an efficient algorithm to construct a model for building a classifier from a preclassified documents, search for the characteristics of terms and categories, to classify a corpus or a set of documents.

A novel text classification problem and its solution (original) (raw)

Related papers