A Survey on Text Categorization Techniques for Indian Regional Languages (original) (raw)

Indian Language Text Representation and Categorization Using Supervised Learning Algorithm

India is the home of different languages, due to its cultural and geographical diversity. The official and regional languages of India play an important role in communication among the people living in the country. In the Constitution of India, a provision is made for each of the Indian states to choose their own official language for communicating at the state level for official purpose. In the eighth schedule as of May 2008, there are 22 official languages in India. The availability of constantly increasing amount of textual data of various Indian regional languages in electronic form has accelerated. So the Classification of text documents based on languages is essential. The objective of the work is the representation and categorization of Indian language text documents using text mining techniques. South Indian language corpus such as Kannada, Tamil and Telugu language corpus, has been created. Several text mining techniques such as naive Bayes classifier, k-Nearest-Neighbor classifier and decision tree for text categorization have been used. There is not much work done in text categorization in Indian languages. Text categorization in Indian languages is challenging as Indian languages are very rich in morphology. In this paper an attempt has been made to categories Indian language text using text mining algorithms.

Automatic Text Categorization of Marathi Language Documents

2016

Information technology generated huge data on the internet. This data is mainly in English language so majority of data mining and natural language processing research work is in English. As the internet usage increased, data in Marathi language also increased. The proposed system presents the document categorization system for Marathi text documents. The system categorizes the Marathi documents and displays it to the end user based on the categories. Similar documents are grouped to form different clusters. The clusters are formed automatically i.e. the system assigns names to the clusters (folders of categorized documents created) based on the content of the documents. Automatic text categorization is useful in better management and retrieval of text documents and also makes document retrieval a simple task. As there has been an increase in digital information available on the internet, there is a growing interest in helping user better find, filter and manage this information.

Assamese Text Classification using k Nearest Neighbor

International Journal of Recent Technology and Engineering (IJRTE), 2019

Knowledge is the most powerful weapon of a society. And in today’s world it is just a click away from the mouse. There is abundance of knowledge and information in the form of newspaper , electronic newspaper ,articles, online journals, webpages , search results etc. And there is a wide range of news from all over the world. But then the choice of news varies from person to person. Some people may prefer sports news to amusement news and some people may prefer political news over sports news and likewise there can be a number of other choices. It completely relies on individual’s decision. Document Classification is the process of classifying a document into a number of predefined classes. In this paper we have done document classification of Assamese text using k-Nearest Neighbor. We have considered only four classes sports , politics , law and science. Our dataset consists of 200 documents collected from major Assamese newspaper . We have divided our data into 3:1. Majority of our...

Telugu Text Categorization using Language Models

Global journal of computer science and technology, 2017

Document categorization has become an emerging technique in the field of research due to the abundance of documents available in digital form. In this paper we propose language dependent and independent models applicable to categorization of Telugu documents. India is a multilingual country; a provision is made for each of the Indian states to choose their own authorized language for communicating at the state level for legitimate purpose. The availability of constantly increasing amount of textual data of various Indian regional languages in electronic form has accelerated. Hence, the Classification of text documents based on languages is crucial. Telugu is the third most spoken language in India and one of the fifteen most spoken language n the world. It is the official language of the states of Telangana and Andhra Pradesh. A variant of k-nearest neighbors algorithm used for categorization process. The results obtained by the Comparisons of language dependent and independent models.

COMPARATIVE STUDY OF CLASSIFICATION ALGORITHM FOR TEXT BASED CATEGORIZATION

Text categorization is a process in data mining which assigns predefined categories to free-text documents using machine learning techniques. Any document in the form of text, image, music, etc. can be classified using some categorization techniques. It provides conceptual views of the collected documents and has important applications in the real world. Text based categorization is made use of for document classification with pattern recognition and machine learning. Advantages of a number of classification algorithms have been studied in this paper to classify documents. An example of these algorithms is: Naive Bayes' algorithm, K-Nearest Neighbor, Decision Tree etc. This paper presents a comparative study of advantages and disadvantages of the above mentioned classification algorithm.

Some Investigations on Machine Learning Techniques for Automated Text Categorization

International Journal of Computer Applications, 2013

The automated categorization (classification) of texts into predefined categories is one of the widely explored fields of research in text mining. Now-a-days, availability of digital data is very high, and to manage them in predefined categories has become a challenging task. Machine learning technique is an approach by which we can train automated classifier to classify the documents with minimum human assistance. This paper discusses the Naïve Bayes, Rocchio, k-Nearest Neighborhood and Support Vector Machine methods within machine learning paradigm for automated text categorization of given documents in predefined categories.

A Novel Text Categorization Approach based on K-means and Support Vector Machine

International Journal of Computer Applications, 2015

Continuous expansion of digital libraries and online news, the huge amount of text documents is existing on the web. Consequently the need is to organize them. Text Categorization is an active analysis field can be used for organizing text document. Text categorization is the process of assigning documents with predefined categories that are associated with their contented. CAWP algorithm is designed for Text Categorization. But this algorithm does not present the best results for large datasets. K-means Clustering with Support Vector Machine approach is used to enhance the results. K-means group the data into a number of clusters follow which it uses as training samples for Support Vector Machine in each cluster to divide the new sample data efficiently. The experiment performed on 20Newsgroups dataset, K-means with SVM provides better results than CAWP algorithm in terms of F-measure.

Text Classification for Marathi Documents using Supervised Learning Methods

International Journal of Computer Applications, 2016

The evolution of Information Technology has led to the collection of large number of text documents. Mostly, researchers worked on English text documents. Today, millions of documents are present in Indian regional languages. So, to classify these documents manually is expensive and time consuming task. Automatic classification can help in better management and retrieval of these documents. From the literature survey, it is found that not much work has been done for classification of Marathi text documents. This paper presents efficient Marathi text classification system using Supervised Learning Methods and Ontology based classification.

Supervised Learning Methods for Bangla Web Document Categorization

This paper explores the use of machine learning approaches, or more specifically, four supervised learning Methods, namely Decision Tree(C 4.5), K-Nearest Neighbour (KNN), Naïve Bays (NB), and Support Vector Machine (SVM) for categorization of Bangla web documents. This is a task of automatically sorting a set of documents into categories from a predefined set. Whereas a wide range of methods have been applied to English text categorization, relatively few studies have been conducted on Bangla language text categorization. Hence, we attempt to analyze the efficiency of those four methods for categorization of Bangla documents. In order to validate, Bangla corpus from various websites has been developed and used as examples for the experiment. For Bangla, empirical results support that all four methods produce satisfactory performance with SVM attaining good result in terms of high dimensional and relatively noisy document feature vectors.