AN ENSEMBLE OF FEATURE SELECTION WITH DEEP LEARNING BASED AUTOMATED TAMIL DOCUMENT CLASSIFICATION MODELS An Ensemble of Feature Selection with Deep Learning based Automated Tamil Document Classification Models (original) (raw)

AN EFFICIENT FEATURE EXTRACTION WITH SUBSET SELECTION MODEL USING MACHINE LEARNING TECHNIQUES FOR TAMIL DOCUMENTS CLASSIFICATION

IAEME PUBLICATION, 2020

In the present days, the development of the internet has resulted in a significant rise in the number of electronic documents in several regional languages. As Tamil Text data in digital format both in online and offline mode is growing significantly nowadays, management and retrieval of the documents is a tedious process. Automatic text classification aims to allocate fixed class labels to unclassified text documents. Many natural language processing (NLP) techniques areextremelydependenton the automatic classification of Tamil Text documents. The current development of machine learning (ML) algorithms helps to attain effective Tamil document classification. In this view, this paper introduces an automated Tamil document classification technique using ML models. The presented model involves different processes such as preprocessing, feature extraction, feature selection, and classification. The proposed model uses term frequency-inverse document frequency (TF-IDF) approach for the feature extraction process. Besides, the Chi-square test is employed to select an optimal set of features. At last, three ML models such as random forest (RF), decision tree (DT), and gradient boosting tree (GBT) are applied to determine the class labels of the Tamil documents. To assess the performance of the presented model, a set of simulations takes place on a Tamil document dataset collected on our own. The experimental values ensured the effective classifier results of the presented model over the compared methods. From the experimental values, it is ensured that the GBT model has reached an effective classification outcome with the maximum accuracy of 85.10%, precision of 87.01%, recall of 85.10%, and F1-score of 85.52%

Classification of heterogeneous Malayalam documents based on structural features using deep learning models

International Journal of Electrical and Computer Engineering (IJECE), 2023

The proposed work gives a comparative study on performance of various pretrained deep learning models for classifying Malayalam documents such as agreement documents, notebook images, and palm leaves. The documents are classified based on their visual and structural features. The dataset was manually collected from different sources. The method of research proceeds with preprocessing, feature extraction, and classification. The proposed work deals with three fine-tuned deep learning models such as visual geometry group-16 (VGG-16), convolutional neural network (CNN) and AlexNet. The models attained high accuracies of 99.7%, 96%, and 95%, respectively. Among the three models, the fine-tuned VGG-16 model was found to perform better attaining a very high accuracy on the dataset. As a future work, methods to classify the documents based on content as well as spectral features can be developed.

Feature Selection using Normalized Weight Method for Tamil Text Classification

The Feature Selection process simplify the Tamil text classification work at present we are in the information age, in this period all the applications has great growth in the domain of World Wide Web, so regional language like Tamil materials such as web pages, e-mails, e-books, and digital data has grown enormously so the retrieval of this Tamil digital document is more wanted by Tamil Document searcher. For quick retrieval of needed Tamil digitized documents among the millions of Tamil web documents, these documents should be classified by content according to their classes. The Tamil Text classification is a background work for many Tamil NLP applications such as query response, information extraction, information summarization, etc. the implementation of text categorization is very important in the information retrieval field. The text categorization assigns a document an appropriate category from a predefined group of categories. Tamil Text Classification classifies the documents based on Tamil text in a Document. Tamil language words are very rich in morphology and hence Tamil language consists of very large set of word forms. So it is important to reduce the features of Tamil text. This paper discusses about Feature selection Using Normalized weight from the huge set of key words from the preprocessed corpus. The Feature selection done by Term Weighting (TF*IDF) normalized method is reducing the size of the key word list which is very useful for training and testing Tamil text classification algorithms.

Nepali Text Document Classification Using Deep Neural Network

Tribhuvan University Journal

An automated text classification is a well-studied problem in text mining which generally demands the automatic assignment of a label or class to a particular text documents on the basis of its content. To design a computer program that learns the model form training data to assign the specific label to unseen text document, many researchers has applied deep learning technologies. For Nepali language, this is first attempt to use deep learning especially Recurrent Neural Network (RNN) and compare its performance to traditional Multilayer Neural Network (MNN). In this study, the Nepali texts were collected from online News portals and their pre-processing and vectorization was done. Finally deep learning classification framework was designed and experimented for ten experiments: five for Recurrent Neural Network and five for Multilayer Neural Network. On comparing the result of the MNN and RNN, it can be concluded that RNN outperformed the MNN as the highest accuracy achieved by MNN...

Bengali Text Classification: A New multi-class Dataset and Performance Evaluation of Machine Learning and Deep Learning Models

This study focuses on Bengali text classification using machine learning and deep learning techniques. Text classification is a fundamental task in natural language processing (NLP) that involves categorizing text documents into predefined categories or classes. While text classification has received considerable attention in the English language, there is a limited amount of research specifically addressing Bengali text classification. This gap in the literature highlights the need for exploring and developing effective techniques tailored to the Bengali language. Furthermore, although some datasets for Bengali text classification tasks have been produced, most of the datasets have a limited number of labels. In our work, we introduce a new dataset for the Bengali text classification task, which has 38 class labels. The dataset includes data from the leading Bengali newspapers. We have evaluated many state-of-the-art machine learning and deep learning classification methods on our ...

SUPERVISED METHODS FOR DOMAIN CLASSIFICATION OF TAMIL DOCUMENTS

The Era of digitization induces the need of domainclassification in both the on-line and off-line applications. The necessity of automatic text classification arises for utilizing it in diverse fields. Hence various methodologies like Machine Learningalgorithms were proposed to do the same. Here automatic document classification of Tamil documents have been proposed by considering the exponential growth of Tamil text documents in the form of unstructured data available as News, Encyclopedias, E-books, E-Governance, Social Media and much more. Max-Ent, CRF and SVM algorithms are used here to achieve more than 90 percentage average accuracy in both the sentence and document level classification of Tamil text documents. In this work Dinakarannewspaper dataset from EMILLE/CIIL Corpus has been utilized to experiment the ability of Machine Learning algorithms in Tamil domain classification.

Benchmarking performance of machine and deep learning-based methodologies for Urdu text document classification

Neural Computing and Applications, 2020

In order to provide benchmark performance for Urdu text document classification, the contribution of this paper is manifold. First, it provides a publicly available benchmark dataset manually tagged against 6 classes. Second, it investigates the performance impact of traditional machine learning based Urdu text document classification methodologies by embedding 10 filter-based feature selection algorithms which have been widely used for other languages. Third, for the very first time, it assesses the performance of various deep learning based methodologies for Urdu text document classification. In this regard, for experimentation, we adapt 10 deep learning classification methodologies which have produced best performance figures for English text classification. Fourth, it also investigates the performance impact of transfer learning by utilizing Bidirectional Encoder Representations from Transformers approach for Urdu language. Fifth, it evaluates the integrity of a hybrid approach which combines traditional machine learning based feature engineering and deep learning based automated feature engineering. Experimental results show that feature selection approach named as Normalised Difference Measure along with Support Vector Machine outshines state-of-the-art performance on two closed source benchmark datasets CLE Urdu Digest 1000k, and CLE Urdu Digest 1Million with a significant margin of 32%, and 13% respectively. Across all three datasets, Normalised Difference Measure outperforms other filter based feature selection algorithms as it significantly uplifts the performance of all adopted machine learning, deep learning, and hybrid approaches. The source code and presented dataset are available at Github repository 1 .

A Computational Framework for Tamil Document Classification using Random Kitchen Sink

—Along the prompt growth in World Wide Web, the availability and accessibility of regional language contents such as e-books, web pages, e-mails, and digital repositories has grown exponentially. As a result, the automatic document classification has become the hotspot for fetching information among the millions of web documents. The idea of classifying the text, forms the baseline for many NLP applications such as information extraction, query response, information summarization, etc. The main objective of this paper is to develop an computational framework for supervised Tamil document classification task. This paper highlights the performance of Random Kitchen Sink, a randomization algorithm, in Grand Unified Regularized Least Squares (GURLS), a Machine Learning Library, is proven to be comparably better than the conventional kernel based classifier in terms of accuracy. Henceforth, we claim that Random Kitchen Sink can be an effective alternative to the kernels for a classifier.

Layered Representation of Bengali Texts in Reduced Dimension Using Deep Feedforward Neural Network for Categorization

21st International Conference of Computer and Information Technology (ICCIT), 2018

Automatic text categorization is a primary step in information retrieval where it is necessary to find the most relevant documents in an enormous volume. It is also useful in a wide range of web domains, such as from portal sites to news indexing, or from spam filtering to genre tagging. A significant amount of research works has been carried out in this field, and they are mostly dominated by Support Vector Machines (SVMs) models. Although these models have been very successful, but they require careful feature engineering to achieve optimum results. In this paper, we propose a model for Bengali text categorization that doesn't require feature engineering and is able to capture nonlinearity in data. We had first found a lower dimensional representation for the tf-idf vectors of each document using denoising autoencoders, and then we fed this transformed domain data vector into a deep feedforward network to find its most plausible category. We also show empirically that our model achieves 94.05% accuracy for 12 categories that surmounts the best existing models on Bengali text categorization.

Text Classification on Tamil

International Journal of Applied Sciences and Smart Technologies, 2021

By and large, we don't know to talk and read the territorial dialects that are spoken in our nation. So we have accepted Tamil language as it is our territorial and numerous doesn't get it. In our task, the content in Tamil language is stacked from Wikipedia. It is then sifted through and extraordinary characters are evacuated it is then characterized by the titles like id, title, URL, etc. It is then used to prepare the model utilizing CNN calculation and the dataset is created. Along these lines, you would now be able to test utilizing an irregular Wikipedia page and the content is grouped by the titles and anticipated.