A Web Application for Text Document Classification Based on K-Nearest Neighbor Algorithm (original) (raw)
Related papers
Building a K-Nearest Neighbor Classifier for Text Categorization
2016
Text categorization is a process of assigning various input texts (or documents) to one or more target categories based on its contents. This paper introduces an email classification application of text categorization, using k-Nearest Neighbor (k-NN) classification[1]. In this work text categorization involves two processes: training process and classification process. First, The training processes use a previously categorized set of documents to train the system to understand what each category looks like[1].Second,the classifier uses the training 'model' to classify new incoming documents.The kNearest Neighbor classification method makes use of training documents, which have known categories, and finds the closest neighbors of the new sample document among all[2]. These neighbors enable to find the new document’s category. The Euclideandistance has been used as a similarity function for measuring the difference or similarity between two instances[3]. Key words–Text categor...
A Query based Text Categorization using K-Nearest Neighbor Approach
research.ijcaonline.org
World Wide Web is the store house of abundant information available in various electronic forms. In the past two decades, the increase in the performance of computers in handling large quantity of text data led researchers to focus on reliable and optimal ...
Semantic Text Categorization using the K Nearest Neighbours Method
2003
In this paper we investigate the influence of semantics in the text categorization. Moreover, we study how the vocabulary size reduction affects this task. The K Nearest Neighbours method was used to perform the categorization. In order to reduce the vocabulary size, the Information Gain technique was employed. A number of different document codification alternatives were tested. The experimental results for the best codifications were obtained taking into account the relevant terms and their synonyms for each text of the 20 Newsgroups and WebKB corpora. For the 20 Newsgroups corpus, the introduction of semantics allowed for an improvement of the performance.
Classification of Text Documents
The exponential growth of the internet has led to a great deal of interest in developing useful and efficient tools and software to assist users in searching the Web. Document retrieval, categorization, routing and filtering can all be formulated as classification problems. However, the complexity of natural languages and the extremely high dimensionality of the feature space of documents have made this classification problem very difficult. We investigate four different methods for document classification: the naive Bayes classifier, the nearest neighbour classifier, decision trees and a subspace method. These were applied to seven-class Yahoo news groups (business, entertainment, health, international, politics, sports and technology) individually and in combination. We studied three classifier combination approaches: simple voting, dynamic classifier selection and adaptive classifier combination. Our experimental results indicate that the naive Bayes classifier and the subspace method outperform the other two classifiers on our data sets. Combinations of multiple classifiers did not always improve the classification accuracy compared to the best individual classifier. Among the three different combination approaches, our adaptive classifier combination method introduced here performed the best. The best classification accuracy that we are able to achieve on this seven-class problem is approximately 83%, which is comparable to the performance of other similar studies. However, the classification problem considered here is more difficult because the pattern classes used in our experiments have a large overlap of words in their corresponding documents.
Document Classification : A Review
2018
As most information is stored as text in web, text document classification is considered to have a high commercial value. Text classification is classifying the documents according to predefined categories. Complexity of natural languages and the very high dimensionality of the feature space of documents have made this classification problem difficult. In this paper we have given the introduction of text classification, process of text classification, overview of the classifiers and compared some existing classifier on basis of few criteria like time principle, merits and demerits.
Online Text Categorization System Using Support Vector Machine.pdf
SURJ, 2018
Text Classification is a need of day, large text existing in the form of stories, news etc. Likewise, this system came into being along several techniques like, Support Vector Machine, Neural Networks and Decision Tree. Stories, newspapers are the page collection that belongs to text categorization. Various Sindhi newspapers are regularly published and Daily Kawish is one of them. People are facing difficulties during reading newspaper because there is no any specific option that will categorize particular news related to sports, technologies, crime, fashion and current affairs. For this purpose, a Text Categorization System (TCS) for Sindhi language is presented in this paper. Five classes are used and scanned each newspaper page inside a single class. It is too difficult to predict how many users will read newspaper simultaneously and for this, web performance is tested. Moreover, for the classification of the text from pages, precision, recall and f-measure are used to measure and achieved 67% of accuracy to classify the text from newspaper pages. It would be beneficial for those who want to save their precious time during reading newspaper.
COMPARATIVE STUDY OF CLASSIFICATION ALGORITHM FOR TEXT BASED CATEGORIZATION
Text categorization is a process in data mining which assigns predefined categories to free-text documents using machine learning techniques. Any document in the form of text, image, music, etc. can be classified using some categorization techniques. It provides conceptual views of the collected documents and has important applications in the real world. Text based categorization is made use of for document classification with pattern recognition and machine learning. Advantages of a number of classification algorithms have been studied in this paper to classify documents. An example of these algorithms is: Naive Bayes' algorithm, K-Nearest Neighbor, Decision Tree etc. This paper presents a comparative study of advantages and disadvantages of the above mentioned classification algorithm.
Text Categorization Using Weight Adjusted K-Nearest Neighbor Classification
… in Knowledge Discovery and Data Mining, 2001
Text categorization is the task of deciding whether a document belongs to a set of prespecified classes of documents. Automatic classification schemes can greatly facilitate the process of categorization. Categorization of documents is challenging, as the number of discriminating words can be very large. Many existing algorithms simply would not work with these many number of features. k-nearest neighbor (k-NN) classification is an instance-based learning algorithm that has shown to be very effective for a variety of problem domains including documents. The key element of this scheme is the availability of a similarity measure that is capable of identifying neighbors of a particular document. A major drawback of the similarity measure used in k-NN is that it uses all features in computing distances. In many document data sets, only smaller number of the total vocabulary may be useful in categorizing documents. A possible approach to overcome this problem is to learn weights for different features (or words in document data sets). In this paper, we propose the Weight Adjusted k-Nearest Neighbor (WAKNN) classification algorithm that is based on the k-NN classification paradigm. In WAKNN, the weights of features are learned using an iterative algorithm. In the weight adjustment step, the weight of each feature is perturbed in small steps to see if the change improves the classification objective function. The feature with the most improvement in the objective function is identified and the corresponding weight is updated. The feature weights are used in the similarity measure computation such that important features contribute more in the similarity measure. Experiments on several real life document data sets show the promise of WAKNN, as it outperforms the state of the art classification algorithms such as C4.5, RIPPER, Rainbow, PEBLS, and VSM.
A Novel Web Page Classification Model using an Improved k Nearest Neighbor Algorithm
full potential, automatic classification of web pages into web directories has become more significant. These web directories help the search engines to provide users with relevant and quick retrieval results. In this paper a novel approach to web page classification is implemented by combining the k nearest neighbor classifier (kNN) and association rule mining algorithm. The web pages are preprocessed and discretized before inducing the classifier. The proposed method for web page classification uses a) a feature weighting scheme based on association rules and b) a distance weighted voting scheme. This distance weighted voting scheme enables the model to work for any value of k, being odd or even. Experiments done on a benchmarking data set namely, WebKB have shown that the web page classification accuracy by the proposed method is significantly better than many of the existing web page classification methods..
An Overview of E-Documents Classification
With the increasing availability of electronic documents and the rapid growth of the World Wide Web, the task of automatic categorization of documents becomes the key method for organizing the information, knowledge and trend detection. With the growing availability of online resources, and popularity of fast and rich resources on web, classification of e-documents, news, personal blogs, and extraction of knowledge and trend from the documents has become an interesting area for research, as the World Wide Web is the fastest media for news and events collection from world. So the growing phenomenon of the textual data needs text mining, machine learning and natural language processing techniques and methodologies to organize and extract pattern and knowledge from the documents. This overview focused on the existing literature and explored the main techniques and methods for automatic documents classification i.e. documents representation, classifier construction and knowledge extraction and also discussed the issues along with the approaches and opportunities.