Comparative Study for Text Document Classification Using Different Machine Learning Algorithms (original) (raw)

Document Classification using Various Classification Algorithms: A Survey

–Text classification is used to classify the document of similar types. Text classification can be also performed under supervision i.e. it is an supervised leaning technique Text classification is a process in which documents are sorted spontaneously into different classes using predefined set. The main issue is that large scale of information lacks organization which makes it difficult to manage. Text classification is identified as one of the key methods used for recognizing such types of digital information. Text classification have various applications such as in information retrieval, natural language processing, automatic indexing, text filtering, image processing, etc. Text classification is also used to process the big data and it can also be used to predict the class labels for newly added data. Text classification is also being used in academic and industries to classify the unstructured data. There are various types of the text classification approaches such as decision tree, SVM, Naïve Bayes etc. In this survey paper, we have analysed the various text classification techniques such as decision tree, SVM, Naïve Bayes etc. These techniques have their individual set of advantages which make them suitable in almost all classification jobs. In this paper we have also analysed evaluation parameters such as F-measure, G-measure and accuracy used in various research works. .

COMPARATIVE STUDY OF CLASSIFICATION ALGORITHM FOR TEXT BASED CATEGORIZATION

Text categorization is a process in data mining which assigns predefined categories to free-text documents using machine learning techniques. Any document in the form of text, image, music, etc. can be classified using some categorization techniques. It provides conceptual views of the collected documents and has important applications in the real world. Text based categorization is made use of for document classification with pattern recognition and machine learning. Advantages of a number of classification algorithms have been studied in this paper to classify documents. An example of these algorithms is: Naive Bayes' algorithm, K-Nearest Neighbor, Decision Tree etc. This paper presents a comparative study of advantages and disadvantages of the above mentioned classification algorithm.

Text Classification Using Machine Learning Techniques: A Comparative Study

— Text mining is drawing enormous attention in this era as there is a huge amount of text data getting generated and it is required very hardly to manage this data to grasp maximum benefit out of it. Text classification is an essential sub-part of text mining where the related text data is assigned to a particular predefined category. In our study, we discussed different classifier techniques which are popularly used in recent years. There is comparison between different classifiers like SVM, Naïve Bayes, Neural Networks etc. which is expressed in a tabular form in this paper.

Machine Learning Algorithms for Document Classification: Comparative Analysis

International Journal of Advanced Computer Science and Applications (IJACSA), 2022

Automated document classification is the machine learning fundamental that refers to assigning automatic categories among scanned images of the documents. It reached the state-of-art stage but it needs to verify the performance and efficiency of the algorithm by comparing. The objective was to get the most efficient classification algorithms according to the usage of the fundamentals of science. Experimental methods were used by collecting data from a sum of 1080 students and researchers from Ethiopian universities and a meta-data set of Banknotes, Crowdsourced Mapping, and VxHeaven provided by UC Irvine. 25% of the respondents felt that KNN is better than the other models. The overall analysis of performance accuracies through various parameters namely accuracy percentage of 99.85%, the precision performance of 0.996, recall ratio of 100%, F-Score 0.997, classification time, and running time of KNN, SVM, Perceptron and Gaussian NB was observed. KNN performed better than the other classification algorithms with a fewer error rate of 0.0002 including the efficiency of the least classification time and running time with ~413 and 3.6978 microseconds consecutively. It is concluded by looking at all the parameters that KNN classifiers have been recognized as the best algorithm.

A Survey on Text Classification using Machine Learning Algorithms

International journal of engineering research and technology, 2019

In today’s world, the usage of digitalized text documents has drastically increased. The reason behind this is the growing need for portability of text related files and the greater need to eliminate the dependence on paper. Previously, the task of document classification was handled by very experienced experts who are capable of classifying large text documents into their corresponding category. Overtime, it was realized that this task extremely time consuming. Therefore the need for automatic text document classification came into the big picture. Corresponding research has shown the involvement of various classification algorithms to create an automated text document classification system. The major tasks involved in creating this type of automated system is handling large amount of texts, selecting the features from a wide range of availability and eventually selecting the classification algorithm which is best suited for classification text files. Initially the predefined class...

A Study on Document Classification using Machine Learning Techniques

2014

With the explosion of information fuelled by the growth of the World Wide Web it is no longer feasible for a human observer to understand all the data coming in or even classify it into categories. With this growth of information and simultaneous growth of available computing power automatic classification of data, particularly textual data, gains increasingly high importance. Text classification is a task of automatically sorting a set of documents into categories from a predefined set and is one of the important research issues in the field of text mining. This paper provides a review of generic text classification process, phases of that process and methods being

Comparison of Classification Algorithms in Text Mining

2017

Web Mining is searching useful data from the World Wide Web repository which is divided into Content Mining, Usage Mining and Structure Mining in which Content Mining uses text, images, Audio and Video to extract useful information which is Unstructured. Web Mining is sub process of Data Mining which involves Anomaly detection, Classification, Clustering, Association Rule Mining, Regression and Summarization. This discovers patterns in large data sets involving many disciplines such as Artificial Intelligence, Machine Learning, Statistics and Database Systems. Machine Learning is the emerging technology to make the machines to predict values for new data inputs according to the previous data inputs trained with some Algorithms. Among all, the Classification is in supervised Learning of Machine learning where a training set of correctly predicted observation is available. In this paper three algorithms Naïve Bayes, Random Forest and Support Vector Machine used in Rapid Miner with 500...

Evaluation of the Document Classification Approaches

2013

This paper deals with one class automatic document classification. Five feature selection methods and three classifiers are evaluated on a Czech corpus in order to build an efficient Czech document classification system. Lemmatization and POS tagging are used for a precise representation of the Czech documents. We demonstrated, that POS tag filtering is very important, while the lemmatization plays a marginal role for classification. We also showed that Maximum Entropy and Support Vector Machines are very robust to the feature vector size and outperform significantly the Naive Bayes classifier from the view point of the classification accuracy. The best classification accuracy is about 90% which is enough for an application for the Czech News Agency, our commercial partner.

A Comparative Study of Machine Learning Approaches for Text Classification

2016

Perhaps the single largest data source in the world is the world wide web. Heterogeneous and unstructured nature of the data on web has challenged mining the web. Practical needs to extract textual information and unseen patterns continue to drive the research interest in text mining. Faultless categorization of texts can be better performed by machine learning techniques. In this paper we present a review of various text classification approaches under machine learning paradigm. Existing classification algorithms including Decision Tree, Naive Bayes, Support Vector Machine and k-Nearest Neighbors are compared based on speed, accuracy, interpretability and multi-class support.

Comparison of Text Classification Algorithms

2015

The paper presents an empirical study of three text classification algorithms using two datasets. Naïve Bayes, Support Vector Machine and C4.5 have been compared by training the dataset instances on the Weka Tool. The two datasets are Diabetes and Calories. Diabetes dataset has a large number of training examples and attributes as compared to the Calories dataset. The results are compared based on the recall and precision values that each of the algorithms are returning. Another basis of comparison is the percentage split of the dataset into training set and test set. Results show that out of the three classifiers, SVM is computationally efficient. SVM has certain disadvantages which degrades its performance for small datasets. Thus, it is proposed that using Hybrid SVM may improve the existing drawbacks of SVM. Even if the approach with which SVM is applied on the dataset is changed, it can produce optimized results.