Supervised Learning Methods for Bangla Web Document Categorization (original) (raw)

A Comparative Study on Different Types of Approaches to Bengali document Categorization

Document categorization is a technique where the category of a document is determined. In this paper three well-known supervised learning techniques which are Support Vector Machine(SVM), Naïve Bayes(NB) and Stochastic Gradient Descent(SGD) compared for Bengali document categorization. Besides classifier, classification also depends on how feature is selected from dataset. For analyzing those classifier performances on predicting a document against twelve categories several feature selection techniques are also applied in this article namely Chi square distribution, normalized TFIDF (term frequency-inverse document frequency) with word analyzer. So, we attempt to explore the efficiency of those three-classification algorithms by using two different feature selection techniques in this article.

Automatic Categorization of Tagalog Documents Using Support Vector Machines

2017

Automatic document classification is now a growing research topic in Natural Language Processing. Several techniques were incorporated to build a classifier that can categorize documents written in specific languages into their designated categories. This study builds an automatic document classifier using machine learning which is suited for Tagalog documents. The documents used were news articles scraped from Tagalog news portals. These documents were manually annotated into different categories and later on, underwent preprocessing techniques such as stemming and removal of stopwords. Different document representations were also used to explore which representation performed best with the classifiers. The SVM classifier using the stemmed dataset which was represented using TF-IDF values yielded an F-score of 91.99% and an overall accuracy of 92%. It outperformed all other combinations of document representations and classifiers.

Experiments on the Use of Feature Selection and Machine Learning Methods in Automatic Malay Text Categorization

Procedia Technology, 2013

Due to the rapid growth of documents in digital form, research in automatic text categorization into predefined categories has witnessed a booming interest. Although, there is a wide range of supervised machine learning methods have been applied to categorize English, relatively, only a few studies have been done on Malay text categorization. This paper reports our comparative evaluation of three machine learning methods on Malay text categorization. Two feature selection methods (Information gain (IG) and Chi-square) and three machine learning methods (K-Nearest Neighbor (k-NN), Naive Bayes (NB) and N-gram) were investigated. The three supervised machine learning models were evaluated on categorized Malay corpus, and experimental results showed that the k-NN with the Chi-square feature selection gave the best performance (Macro-F1 = 96.14).

COMPARATIVE STUDY OF CLASSIFICATION ALGORITHM FOR TEXT BASED CATEGORIZATION

Text categorization is a process in data mining which assigns predefined categories to free-text documents using machine learning techniques. Any document in the form of text, image, music, etc. can be classified using some categorization techniques. It provides conceptual views of the collected documents and has important applications in the real world. Text based categorization is made use of for document classification with pattern recognition and machine learning. Advantages of a number of classification algorithms have been studied in this paper to classify documents. An example of these algorithms is: Naive Bayes' algorithm, K-Nearest Neighbor, Decision Tree etc. This paper presents a comparative study of advantages and disadvantages of the above mentioned classification algorithm.

Some Investigations on Machine Learning Techniques for Automated Text Categorization

International Journal of Computer Applications, 2013

The automated categorization (classification) of texts into predefined categories is one of the widely explored fields of research in text mining. Now-a-days, availability of digital data is very high, and to manage them in predefined categories has become a challenging task. Machine learning technique is an approach by which we can train automated classifier to classify the documents with minimum human assistance. This paper discusses the Naïve Bayes, Rocchio, k-Nearest Neighborhood and Support Vector Machine methods within machine learning paradigm for automated text categorization of given documents in predefined categories.

Online Text Categorization System Using Support Vector Machine.pdf

SURJ, 2018

Text Classification is a need of day, large text existing in the form of stories, news etc. Likewise, this system came into being along several techniques like, Support Vector Machine, Neural Networks and Decision Tree. Stories, newspapers are the page collection that belongs to text categorization. Various Sindhi newspapers are regularly published and Daily Kawish is one of them. People are facing difficulties during reading newspaper because there is no any specific option that will categorize particular news related to sports, technologies, crime, fashion and current affairs. For this purpose, a Text Categorization System (TCS) for Sindhi language is presented in this paper. Five classes are used and scanned each newspaper page inside a single class. It is too difficult to predict how many users will read newspaper simultaneously and for this, web performance is tested. Moreover, for the classification of the text from pages, precision, recall and f-measure are used to measure and achieved 67% of accuracy to classify the text from newspaper pages. It would be beneficial for those who want to save their precious time during reading newspaper.

Accurate Prediction of Bangla Text Article Categorization by Utilizing Novel Bangla Stemmer

International Journal of Automation and Smart Technology, 2024

Text categorization involves assigning predefined category labels to an unlabeled document. With the exponential growth in the accessibility and availability of digital documents over the past decade, this field significantly attracted the scientific community that immensely demands rapid and accurate categorization of these documents. Relying on experts for manual classification is time-consuming and resource-intensive. Consequently, labeling unlabeled digital documents faster more accurately, and more efficiently is inescapable. One promising approach to addressing this demand is the use of machine learning algorithms. Training these algorithms on a large dataset of labeled texts lets them learn patterns and predicted unlabeled documents. This strategy might greatly expedite the categorizing process while retaining a substantial level of accuracy through leveraging artificial intelligence. These algorithms have also enhanced natural language processing techniques, making them more accurate at classifying unlabeled digital documents. In this study, we propose a novel machine-learning computational framework to address this challenge. Our framework incorporates a novel Bangla stemmer, which reduces words to their stems. We then employed TF-IDF for document vectorization, a statistical measure assessing word relevance for categorization purposes. Experimental results reveal that our framework significantly enhances prediction performance, achieving an impressive 95.3% prediction accuracy.

Automated Text Categorization with Machine Learning and its Application in Multilingual Text Categorization

The automated categorization (or classification) of texts into predefined categories is one of the booming field of text mining. Now a days availability of digital data is very high, and to manage them in predefine categories becomes challenging task. Machine learning is a technique by which we can make automated classifier to classify the document with minimum human assistance. The advantages of this approach over the knowledge engineering approach are effectiveness, considerable savings in terms of expert labor power, and straightforward portability to different domains. This paper discusses the Naïve Bayes, Rocchio and kNN methods within machine learning paradigm for automated text categorization of document in predefined categories. We are also like to discuss multilingual text categorization, that consists in classifying documents in different languages according to the same classification tree.

Reducing feature space and analyzing effects of using non linear kernels in SVM for Bangla news categorization

2018 International Conference on Bangla Speech and Language Processing (ICBSLP), 2018

Text categorization is a trending topic nowadays. In this paper, we analyzed some existing approaches for Bangla document categorization and proposed some modifications of them. Using our modified approach we achieved an accuracy of 92.79% which is the best accuracy so far in the dataset that we used which consists of more than 30000 documents. In short we used tf-idf mixed with term frequency threshold as our feature selection technique and SVM as our classifier to classify the documents. We also greatly reduced the feature space and computation time using our approach.