Language Identifcation.pdf (original) (raw)

Language Identification Based on N-Gram Feature Extraction Method by Using Classifiers

Iu Journal of Electrical Electronics Engineering, 2013

The rising opportunities of communication provided us with many documents in many different languages. Language identification has a key role for these documents to be understandable and to study natural language identification procedures. The increasing number of the documents and international communication requirements make new works on language identification obligatory. Until today, there have been a great number of studies on solving language identification problem about document based language identification. In these studies, characters, words and n-gram sequences have been used with machine learning techniques. In this study, sequence of n-gram frequencies will be used and using of the five different classification algorithms' accuracy performances will be analyzed via different sizes of documents belonging to 15 different languages. N-gram based feature method will be used to extract feature vector belonging to languages. The most appropriate method for the problem of language identification will be identified by comparing the performances of the Support Vector Machines, Multilayer Perceptron, Centroid Classifier, k-Means and Fuzzy C Means methods. During the experiments, trainining and testing data will be selected from ECI multilingual corpus.

Modeling Text Language Identification for Ethiopian Cushitic Languages

Abstract In the past decade, successful natural language processing applications such as spelling and grammar correction in word processor, machine translation on the web, email spam detection, automatic question answering, and identification of language in textual or speech form have become part of our everyday experience. There are various approaches that can be used in textbased language identification. A pure linguistic approach would be the best candidate where high classification accuracies are needed though it requires a large amount of linguistic expertise. In this research n-gram frequency rank order and Naïve Bayes were compared as language identifier for Ethiopian Cushitic languages. Frequency of N-gram and its location in a word which is one of the contributions of this research and n-gram feature sets were compared for both models. Higher identification accuracy rate was achieved when n-gram and its location in a word frequency was used as a feature set on both models. The corpus for the study was collected from sources such as TV news websites, Bible, news bulletins, government documents, and documents from ministry of education to insure the corpus spans various domains .WebCorp tool was used to collect corpus from news web sites. Per language the size of collected text corpus after data cleaning varied from 71,712 words for Afar to 150000 words for Somali. Learning curve analysis was made using various training set size as a function fixed test to determine the size of corpus required for the experiment. Documents of sizes 15, 100, and 300 characters windows were used to evaluate the models. For test string of size 15 characters accuracy of 99.55% on character n-gram feature set and 99.78% on character n-gram and its location in a word feature set was achieved for Naïve Bayes classifier. . The identification accuracy rate of NB for both FS when the test sting size is more than 100 characters is 100%. For test string of size 300 characters using frequency rank order as a classifier, accuracy of 63.55% on character n-gram feature set and accuracy of 86.78% on character n-gram and its location in a word feature set was achieved. Key words: language identification, feature set, Naïve Bayes, n‐gram, corpus, n‐gram‐location

Implementation and Evaluation of a Language Identification System for Mono-and Multi-lingual Texts

Language identification is a classification task between a pre-defined model and a text in an unknown language. This paper presents the implementation of a tool for language identification for mono-and multi-lingual documents. The tool includes four algorithms for language identification. An evaluation for eight languages including Ukrainian and Russian and various text lengths is presented. It could be shown that n-gram-based approaches outperform word-based algorithms for short texts. For longer texts, the performance is comparable. The tool can also identify language changes within one multi-lingual document.

Performance comparison between naive bayes and k- nearest neighbor algorithm for the classification of Indonesian language articles

IAES International Journal of Artificial Intelligence, 2021

The match between the contents of the article and the article theme is the main factor whether or not an article is accepted. Many people are still confused to determine the theme of the article appropriate to the article they have. For that reason, we need a document classification algorithm that can group the articles automatically and accurately. Many classification algorithms can be used. The algorithm used in this study is naive bayes and the k-nearest neighbor algorithm is used as the baseline. The naive bayes algorithm was chosen because it can produce maximum accuracy with little training data. While the k-nearest neighbor algorithm was chosen because the algorithm is robust against data noise. The performance of the two algorithms will be compared, so it can be seen which algorithm is better in classifying documents. The comes about obtained show that the naive bayes algorithm has way better execution with an accuracy rate of 88%, while the k-nearest neighbor algorithm has ...

Letter Based Text Scoring Method for Language Identification

Lecture Notes in Computer Science, 2004

In recent years, an unexpected amount of growth has been observed in the volume of text documents on the internet, intranet, digital libraries and news groups. It is an important issue to obtain useful information and meaningful patterns from these documents. Identification of Languages of these text documents is an important problem which is studied by many researchers. In these researches generally words (terms) have been used for language identification. Researchers have studied on different approaches like linguistic and statistical based. In this work, Letter Based Text Scoring Method has been proposed for language identification. This method is based on letter distributions of texts. Text scoring has been performed to identify the language of each text document. Text scores are calculated by using letter distributions of new text document. Besides its acceptable accuracy proposed method is easier and faster than short terms and n-gram methods.

Automated Text Categorization with Machine Learning and its Application in Multilingual Text Categorization

The automated categorization (or classification) of texts into predefined categories is one of the booming field of text mining. Now a days availability of digital data is very high, and to manage them in predefine categories becomes challenging task. Machine learning is a technique by which we can make automated classifier to classify the document with minimum human assistance. The advantages of this approach over the knowledge engineering approach are effectiveness, considerable savings in terms of expert labor power, and straightforward portability to different domains. This paper discusses the Naïve Bayes, Rocchio and kNN methods within machine learning paradigm for automated text categorization of document in predefined categories. We are also like to discuss multilingual text categorization, that consists in classifying documents in different languages according to the same classification tree.

Application of Naïve Bayes, Decision Tree, and K-Nearest Neighbors for Automated Text Classification

Modern Applied Science, 2019

Nowadays, many applications that use large data have been developed due to the existence of the Internet of Things. These applications are translated into different languages and require automated text classification (ATC). The ATC process depends on the content of one or more predefined classes. However, this process is problematic for the Arabic translation of the data. This study aims to solve this issue by investigating the performances of three classification algorithms, namely, k-nearest neighbor (KNN), decision tree (DT), and naïve Bayes (NB) classifiers, on Saudi Press Agency datasets. Results showed that the NB algorithm outperformed DT and KNN algorithms in terms of precision, recall, and F1. In future works, a new algorithm that can improve the handling of the ATC problem will be developed.

Language Identification Using Combination of Machine Learning Algorithms and Vectorization Techniques

IEEE Xplore, 2022

Language Identification refers to the process of ascertaining and discerning the language found in a particular text or document. In this work, approaches for language identification, using Machine Learning Algorithms and Vectorization methods have been compared and contrasted. Three machine learning algorithms, along with two vectorization techniques have been used. The ML Algorithms used are Naïve bayes, Logistic Regression, and SVM (Support Vector Machine), and the vectorization techniques used are Term Frequency-Inverse Document Frequency (TF-IDF), and Count Vectorizer (Bag of Words (BoW)). This research put forwards the contrast and comparison of the above-mentioned classification algorithms and vectorization methods. It is also a web development-based work.

Optimizing n‑gram Order of an n‑gram Based Language Identification Algorithm for 68 Written Languages

International Journal on Advances in Ict for Emerging Regions (icter), 2009

Language identification technology is widely used in the domains of machine learning and text mining. Many researchers have achieved excellent results on a few selected European languages. However, the majority of African and Asian languages remain untested. The primary objective of this research is to evaluate the performance of our new n-gram based language identification algorithm on 68 written languages used in the European, African and Asian regions. The secondary objective is to evaluate how n-gram orders and a mix n-gram model affect the relative performance and accuracy of language identification. The n-gram based algorithm used in this paper does not depend on the n-gram frequency. Instead, the algorithm is based on a Boolean method to determine the output of matching target n-grams to training n-grams. The algorithm is designed to automatically detect the language, script and character encoding scheme of a written text. It is important to identify these three properties due to the reason that a language can be written in different types of scripts and encoded with different types of character encoding schemes. The experimental results show that in one test the algorithm achieved up to 99.59% correct identification rate on selected languages. The results also show that the performance of language identification can be improved by using a mix n-gram model of bigram and trigram. The mix n-gram model consumed less disk space and computing time, compared to a trigram model.

Intelligent Arabic Text Categorization: Initial Study and Proposed Methodology on Classifying Arabic Text Using Enhanced Naïve Bayes Classification Approach

Article in Journal of Advanced Research in Dynamical and Control Systems , 2018

Arabic speaking users in the world is increasing with more significance, higher depth and breadth at more pace, the reason is due to more impact on internet based resources. Due to increase in more Arabic users in the internet, there is necessity of computational techniques to categorize the Arabic text similar to English language. Arabic language is a complex in nature comparing to other speaking and scripting languages, so it requires detail research investigation on analysis of root extraction and text classification approaches for text datasets which are labelled in the form of single or multiple. The main objective of this paper is to propose a new study to improve the automated processing of Arabic texts. There are several machine learning approaches are existing, specifically, for this research, improved naïve Bayes classifier is applied. In first step, the documents which are unclassified are pre-processed by the method of punctuation removal and stop words. Second, after pre-processing, each document is represented by vector of words and frequencies as per the case of Naïve Bayes Classifier approach. Third, the technique of stemming was applied reduce the feature vector dimensionality. Fourth, classification is applied to categorise the Arabic text. The proposed work is an initial study and basic experiment was tested with an in-house Arabic text collection (i.e. selfdeveloped Arabic Corpus). Based on initial study using Naïve Bayes approach for Arabic text categorization, results of the classifier was promising compared to existing classifiers based on accuracy, precession, recall and error rates.