Text Categorization Using Weight Adjusted K-Nearest Neighbor Classification (original) (raw)

Towards enriching the quality of k-nearest neighbor rule for document classification

International Journal of Machine Learning and Cybernetics, 2013

The k-nearest neighbor rule is a simple and effective classifier for document classification. In this method, a document is put into a particular class if the class has the maximum representation among the k nearest neighbors of the documents in the training set. The k nearest neighbors of a test document are ordered based on their content similarity with the documents in the training set. Document classification is very challenging due to the large number of attributes present in the data set. Many attributes, due to the sparsity of the data, do not provide any information about a particular document. Thus, assigning a document to a predefined class for a large value of k may not be accurate when the margin of majority voting is one or when a tie occurs. This article tweaks the knn rule by putting a threshold on the majority voting and the method proposes a discrimination criterion to prune the actual search space of the test document. The proposed classification rule will enhance the confidence of the voting process and it makes no prior assumption about the number of nearest neighbors. The experimental evaluation using various well known text data sets show that the accuracy of the proposed method is significantly better than the traditional knn method as well as some other document classification methods.

Building a K-Nearest Neighbor Classifier for Text Categorization

2016

Text categorization is a process of assigning various input texts (or documents) to one or more target categories based on its contents. This paper introduces an email classification application of text categorization, using k-Nearest Neighbor (k-NN) classification[1]. In this work text categorization involves two processes: training process and classification process. First, The training processes use a previously categorized set of documents to train the system to understand what each category looks like[1].Second,the classifier uses the training 'model' to classify new incoming documents.The kNearest Neighbor classification method makes use of training documents, which have known categories, and finds the closest neighbors of the new sample document among all[2]. These neighbors enable to find the new document’s category. The Euclideandistance has been used as a similarity function for measuring the difference or similarity between two instances[3]. Key words–Text categor...

A Hybrid Text Classification Approach Using KNN and SVM

Text classification is the process of assigning text documents based on certain categories. A classifier is used to define the appropriate class for each text document based on the input algorithm used for classification. Due to the emerging trends in the field of internet and computers ,billions of text data are processed at a given time and so there is a need for organizing these data to provide easy storage and accessing .Many text classification approaches were developed for effectively solving the problem of identifying and classifying these data .In this project a new text document classifier is proposed by integrating the nearest neighbor classification approach with the support vector machine(SVM) training algorithm. The proposed SVM-NN approach aims to reduce the impact of parameters in classification accuracy. In the training stage, the SVM is utilized to reduce the training samples for each of the available categories to their support vectors (SVs).The SVs from different categories are used as the training data of nearest neighbor classification algorithm in which the nearest centroid distance function is used to calculate the average distance instead of Euclidean function, which reduce time consumption.

A Step towards the Improvement in the Performance of Text Classification

KSII Transactions on Internet and Information Systems, 2019

The performance of text classification is highly related to the feature selection methods. Usually, two tasks are performed when a feature selection method is applied to construct a feature set; 1) assign score to each feature and 2) select the top-N features. The selection of top-N features in the existing filter-based feature selection methods is biased by their discriminative power and the empirical process which is followed to determine the value of N. In order to improve the text classification performance by presenting a more illustrative feature set, we present an approach via a potent representation learning technique, namely DBN (Deep Belief Network). This algorithm learns via the semantic illustration of documents and uses feature vectors for their formulation. The nodes, iteration, and a number of hidden layers are the main parameters of DBN, which can tune to improve the classifier's performance. The results of experiments indicate the effectiveness of the proposed method to increase the classification performance and aid developers to make effective decisions in certain domains.

Increasing Accuracy of K-Nearest Neighbor Classifier for Text Classification

International Journal of Computer Science and Informatics, 2014

k - Nearest Neighbor Rule is a well-known technique for text classification. The reason behind this is its simplicity, effectiveness, easily modifiable. In this paper, we briefly discuss text classification, k-NN algorithm and analyse the sensitivity problem of k value. To overcome this problem, we introduced inverse cosine distance weighted voting function for text classification. Therefore, Accuracy of text classification is increased even if any large value for k is chosen, as compared to simple k Nearest Neighbor classifier. The proposed weighted function is proved as more effective when any application has large text dataset with some dominating categories, using experimental results.

Evaluation of Neural Network Classification Systems on Document Stream

Document Analysis Systems

One major drawback of state of the art Neural Networks (NN)-based approaches for document classification purposes is the large number of training samples required to obtain an efficient classification. The minimum required number is around one thousand annotated documents for each class. In many cases it is very difficult, if not impossible, to gather this number of samples in real industrial processes. In this paper, we analyse the efficiency of NN-based document classification systems in a sub-optimal training case, based on the situation of a companys document stream. We evaluated three different approaches, one based on image content and two on textual content. The evaluation was divided into four parts: a reference case, to assess the performance of the system in the lab; two cases that each simulate a specific difficulty linked to document stream processing; and a realistic case that combined all of these difficulties. The realistic case highlighted the fact that there is a significant drop in the efficiency of NN-Based document classification systems. Although they remain efficient for well represented classes (with an over-fitting of the system for those classes), it is impossible for them to handle appropriately less well represented classes. NN-Based document classification systems need to be adapted to resolve these two problems before they can be considered for use in a companys document stream.

High-accuracy document classification with a new algorithm

Electronics Letters, 2018

A new algorithm based on learning vector quantisation classifier is presented based on a modified proximity-measure, which enforces a predetermined correct classification level in training while using sliding-mode approach for stable variation in weight updates towards convergence. The proposed algorithm and some well-known counterparts are implemented by using Python libraries and compared in a task of text classification for document categorisation. Results reveal that the new classifier is a successful contender to those algorithms in terms of testing and training performances.

Classification of Text Documents

The exponential growth of the internet has led to a great deal of interest in developing useful and efficient tools and software to assist users in searching the Web. Document retrieval, categorization, routing and filtering can all be formulated as classification problems. However, the complexity of natural languages and the extremely high dimensionality of the feature space of documents have made this classification problem very difficult. We investigate four different methods for document classification: the naive Bayes classifier, the nearest neighbour classifier, decision trees and a subspace method. These were applied to seven-class Yahoo news groups (business, entertainment, health, international, politics, sports and technology) individually and in combination. We studied three classifier combination approaches: simple voting, dynamic classifier selection and adaptive classifier combination. Our experimental results indicate that the naive Bayes classifier and the subspace method outperform the other two classifiers on our data sets. Combinations of multiple classifiers did not always improve the classification accuracy compared to the best individual classifier. Among the three different combination approaches, our adaptive classifier combination method introduced here performed the best. The best classification accuracy that we are able to achieve on this seven-class problem is approximately 83%, which is comparable to the performance of other similar studies. However, the classification problem considered here is more difficult because the pattern classes used in our experiments have a large overlap of words in their corresponding documents.

A Hierarchical K-NN Classifier for Textual Data

2002

This paper presents a classifier that is based on a modified version of the well known K-Nearest Neighbors classifier (K-NN). The original K-NN classifier was adjusted to work with category representatives rather than training documents. Each category was represented by one document that was constructed by consulting all of its training documents and then applying feature selection so that only important terms remain. By this, when classifying a new document, it is required to be compared with category representatives and these are usually substantially fewer than training documents. This modified K-NN was experimented with in a hierarchical setting, i.e. when categories are represented as a hierarchy. Also, a new document similarity measure was proposed. It focuses on co-occurring or matching terms between a document and a category when calculating the similarity. This measure produces classification accuracy compared to the one obtained if the cosine, Jaccard or Dice similarity measures were used; yet it requires a much less time. The TrechTC-100 hierarchical dataset was used to evaluate the proposed classifier.

A Robust Hybrid Approach for Textual Document Classification

2019 International Conference on Document Analysis and Recognition (ICDAR)

Text document classification is an important task for diverse natural language processing based applications. Traditional machine learning approaches mainly focused on reducing dimensionality of textual data to perform classification. This although improved the overall classification accuracy, the classifiers still faced sparsity problem due to lack of better data representation techniques. Deep learning based text document classification, on the other hand, benefitted greatly from the invention of word embeddings that have solved the sparsity problem and researchers focus mainly remained on the development of deep architectures. Deeper architectures, however, learn some redundant features that limit the performance of deep learning based solutions. In this paper, we propose a two stage text document classification methodology which combines traditional feature engineering with automatic feature engineering (using deep learning). The proposed methodology comprises a filter based feature selection (FSE) algorithm followed by a deep convolutional neural network. This methodology is evaluated on the two most commonly used public datasets, i.e., 20 Newsgroups data and BBC news data. Evaluation results reveal that the proposed methodology outperforms the state-of-the-art of both the (traditional) machine learning and deep learning based text document classification methodologies with a significant margin of 7.7% on 20 Newsgroups and 6.6% on BBC news datasets.

Text Categorization Using Weight Adjusted K-Nearest Neighbor Classification (original) (raw)

Related papers