A Novel Text Categorization Approach based on K-means and Support Vector Machine (original) (raw)
Related papers
Online Text Categorization System Using Support Vector Machine.pdf
SURJ, 2018
Text Classification is a need of day, large text existing in the form of stories, news etc. Likewise, this system came into being along several techniques like, Support Vector Machine, Neural Networks and Decision Tree. Stories, newspapers are the page collection that belongs to text categorization. Various Sindhi newspapers are regularly published and Daily Kawish is one of them. People are facing difficulties during reading newspaper because there is no any specific option that will categorize particular news related to sports, technologies, crime, fashion and current affairs. For this purpose, a Text Categorization System (TCS) for Sindhi language is presented in this paper. Five classes are used and scanned each newspaper page inside a single class. It is too difficult to predict how many users will read newspaper simultaneously and for this, web performance is tested. Moreover, for the classification of the text from pages, precision, recall and f-measure are used to measure and achieved 67% of accuracy to classify the text from newspaper pages. It would be beneficial for those who want to save their precious time during reading newspaper.
Categorization of the Documents by Using Machine Learning
2015
Machine learning is a scientific discipline that explores the construction and study of algorithms that can learn from data. Clustering is one of the applications of Machine Learning. Clustering is nothing but grouping data items together into classes as per the similarity among themselves. Data items within the class have high similarity in comparison to one another but are very dissimilar to data items in other class. The K-means algorithm is one of the widely used clustering algorithms. It is easy to implement and can be applied to wide variety of problems. But use of K-means was restricted to small datasets. Hadoop system supports Mapreduce which divides input into small inputs and operations are executed in distributed manner. It is inexpensive, scalable, free and open source which is why a promising technology for data intensive problems. Aim of this work is to use K-means algorithm for large dataset on Hadoop and then checking the performance of the same by increasing data an...
IRJET- Text Document categorization using support vector machine
The Web is a tremendous source of information, so tremendous that it becomes difficult for human beings to select meaningful information without support. Categorization of documents refers to the problem of automatic classification of a set of documents in classes (or categories or topics). Automatic Text Categorization is an important issue in the text mining. The task is to automatically classify text documents into predefined classes based on their content. Automatic categorization of text documents has become an important research issue now a days. Proper categorization of text documents requires information retrieval, machine learning and Natural language processing (NLP) techniques. Our aim is to focus on important approaches to automatic text categorization based on machine learning technique. Several methods have been proposed for the text documents categorization. We will adapt and create machine learning algorithms for use with the Web's distinctive structures: large-scale, noisy, varied data with potentially rich, human-oriented features using svm.
Some Investigations on Machine Learning Techniques for Automated Text Categorization
International Journal of Computer Applications, 2013
The automated categorization (classification) of texts into predefined categories is one of the widely explored fields of research in text mining. Now-a-days, availability of digital data is very high, and to manage them in predefined categories has become a challenging task. Machine learning technique is an approach by which we can train automated classifier to classify the documents with minimum human assistance. This paper discusses the Naïve Bayes, Rocchio, k-Nearest Neighborhood and Support Vector Machine methods within machine learning paradigm for automated text categorization of given documents in predefined categories.
Categorizing document by fuzzy C-Means and K-nearest neighbors approach
AIP Conference Proceedings, 2017
Increasing of technology had made categorizing documents become important. It caused by increasing of number of documents itself. Managing some documents by categorizing is one of Information Retrieval application, because it involve text mining on its process. Whereas, categorization technique could be done both Fuzzy C-Means (FCM) and K-Nearest Neighbors (KNN) method. This experiment would consolidate both methods. The aim of the experiment is increasing performance of document categorize. First, FCM is in order to clustering training documents. Second, KNN is in order to categorize testing document until the output of categorization is shown. Result of the experiment is 14 testing documents retrieve relevantly to its category. Meanwhile 6 of 20 testing documents retrieve irrelevant to its category. Result of system evaluation shows that both precision and recall are 0,7.
A Survey on Machine Learning Based Text Categorization
2018
Due to the availability of documents in the digital form becoming enormous the need to access them into more adjustable way becoming extremely important. In this context,document management tasks based on content is called as IR or Information Retrieval. Thishas achieved a noticeable position in the area of information system.For faster response time of IR,it is very important and essential to organize,categorize and classify texts and digital documents according to the definitions,proposed by Text Mining experts and Computer scientists.Automatic text Categorization or Topic Spotting,is a process to sorta document set automatically into categories from a predefined set.According to researchers the superior access to this problem depends on machine learning methods in which,a general posteriori process builds a classifier automatically by learning pre-classified documents given and the category’s characteristics.The acceptance of automatic text categorization is done becauseit is fre...
COMPARATIVE STUDY OF CLASSIFICATION ALGORITHM FOR TEXT BASED CATEGORIZATION
Text categorization is a process in data mining which assigns predefined categories to free-text documents using machine learning techniques. Any document in the form of text, image, music, etc. can be classified using some categorization techniques. It provides conceptual views of the collected documents and has important applications in the real world. Text based categorization is made use of for document classification with pattern recognition and machine learning. Advantages of a number of classification algorithms have been studied in this paper to classify documents. An example of these algorithms is: Naive Bayes' algorithm, K-Nearest Neighbor, Decision Tree etc. This paper presents a comparative study of advantages and disadvantages of the above mentioned classification algorithm.
Text Categorization Model Based on Linear Support Vector Machine
2022
Spam mails constitute a lot of nuisances in our electronic mail boxes, as they occupy huge spaces which could rather be used for storing relevant data. They also slow down network connection speed and make communication over a network slow. Attackers have often employed spam mails as a means of sending phishing mails to their targets in order to perpetrate data breach attacks and other forms of cybercrimes. Researchers have developed models using machine learning algorithms and other techniques to filter spam mails from relevant mails, however, some algorithms and classifiers are weak, not robust, and lack visualization models which would make the results interpretable by even non-tech savvy people. In this work, Linear Support Vector Machine (LSVM) was used to develop a text categorization model for email texts based on two categories: Ham and Spam. The processes involved were dataset import, preprocessing (removal of stop words, vectorization), feature selection (weighing and sele...
Categorizing Text Documents Using Naïve Bayes, SVM and Logistic Regression
2020
Categorizing Text documents is the method of arranging different types of documents into labelled data. The field of this paper is to combine the Data mining Technology, Data extraction and Artificial Intelligence for text categorization. This paper will showcase the features of the technologies involved. There are three machine learning algorithms (SVM, Multinomial Naive Bayes and Logistic Regression) used in this paper for text categorization, i.e. arrange documents into different categories of dataset 20 news groups. In the evaluation of the above classification techniques, SVM classifier outperforms other classifiers for text categorization.