A comparative study on language identification methods (original) (raw)

Evaluation of language identification methods using 285 languages

2017

Language identification is the task of giving a language label to a text. It is an important preprocessing step in many automatic systems operating with written text. In this paper, we present the evaluation of seven language identification methods that was done in tests between 285 languages with an out-of-domain test set. The evaluated methods are, furthermore, described using unified notation. We show that a method performing well with a small number of languages does not necessarily scale to a large number of languages. The HeLI method performs best on test lengths of over 25 characters, obtaining an F1-score of 99.5 already at 60 characters.

Comparing Two Language Identification Schemes

Proceedings of the 3rd International Conference on the …, 1995

Here, we compare two techniques for automatic language identification given machine readable text using easily calculable attributes. One technique uses letter trigrams (sequences of three letters) and was previously described (Beesley 1988) and (Cavnar 1993). The other techniques is based on common short words, such as those given in (Ingle 1976). Variations of these techniques are presented here. Both techniques are applied to the same test suite and their results are evaluated.

Language Identification based on n-gram Frequency Ranking

… Annual Conference of …, 2007

We present a novel approach for language identification based on a text categorization technique, namely an n-gram frequency ranking. We use a Parallel phone recognizer, the same as in PPRLM, but instead of the language model, we create a ranking with the most frequent n-grams, keeping only a fraction of them. Then we compute the distance between the input sentence ranking and each language ranking, based on the difference in relative positions for each n-gram. The objective of this ranking is to be able to model reliably a longer span than PPRLM, namely 5-gram instead of trigram, because this ranking will need less training data for a reliable estimation. We demonstrate that this approach overcomes PPRLM (6% relative improvement) due to the inclusion of 4gram and 5-gram in the classifier. We present two alternatives: ranking with absolute values for the number of occurrences and ranking with discriminative values (11% relative improvement).

A study in language identification

Proceedings of the Seventeenth Australasian Document Computing Symposium on - ADCS '12, 2012

Language identification is automatically determining the language that a previously unseen document was written in. We compared several prior methods on samples from the Wikipedia and the EuroParl collections. Most of these methods work well. But we identify that these (and presumably other document) collections are heterogeneous in size, and short documents are systematically different from large ones. That techniques that work well on long documents are different from those that work well on short ones. We believe that improvement in algorithms will be seen if length is taken into account.

Language identification based on a discriminative text categorization technique

In this paper, we describe new results and improvements to a language identification (LID) system based on PPRLM previously introduced in [1] and [2]. In this case, we use as parallel phone recognizers the ones provided by the Brno University of Technology for Czech, Hungarian, and Russian languages, and instead of using traditional n-gram language models we use a language model that is created using a ranking with the most frequent and discriminative n-grams. In this language model approach, the distance between the ranking for the input sentence and the ranking for each language is computed, based on the difference in relative positions for each n-gram. This approach is able to model reliably longer span information than in traditional language models obtaining more reliable estimations. We also describe the modifications that we have being introducing along the time to the original ranking technique, e.g., different discriminative formulas to establish the ranking, variations of the template size, the suppression of repeated consecutive phones, and a new clustering technique for the ranking scores. Results show that this technique provides a 12.9% relative improvement over PPRLM. Finally, we also describe results where the traditional PPRLM and our ranking technique are combined.

Optimizing n‑gram Order of an n‑gram Based Language Identification Algorithm for 68 Written Languages

International Journal on Advances in Ict for Emerging Regions (icter), 2009

Language identification technology is widely used in the domains of machine learning and text mining. Many researchers have achieved excellent results on a few selected European languages. However, the majority of African and Asian languages remain untested. The primary objective of this research is to evaluate the performance of our new n-gram based language identification algorithm on 68 written languages used in the European, African and Asian regions. The secondary objective is to evaluate how n-gram orders and a mix n-gram model affect the relative performance and accuracy of language identification. The n-gram based algorithm used in this paper does not depend on the n-gram frequency. Instead, the algorithm is based on a Boolean method to determine the output of matching target n-grams to training n-grams. The algorithm is designed to automatically detect the language, script and character encoding scheme of a written text. It is important to identify these three properties due to the reason that a language can be written in different types of scripts and encoded with different types of character encoding schemes. The experimental results show that in one test the algorithm achieved up to 99.59% correct identification rate on selected languages. The results also show that the performance of language identification can be improved by using a mix n-gram model of bigram and trigram. The mix n-gram model consumed less disk space and computing time, compared to a trigram model.

Language identification from text using n-gram based cumulative frequency addition

2004

This paper describes the preliminary results of an efficient language classifier using an ad-hoc Cumulative Frequency Addition of N-grams. The new classification technique is simpler than the conventional Naïve Bayesian classification method, but it performs similarly in speed overall and better in accuracy on short input strings. The classifier is also 5-10 times faster than N-gram based rank-order statistical classifiers. Language classification using N-gram based rank-order statistics has been shown to be highly accurate and insensitive to typographical errors, and, as a result, this method has been extensively researched and documented in the language processing literature. However, classification using rank-order statistics is slower than other methods due to the inherent requirement of frequency counting and sorting of N-grams in the test document profile. Accuracy and speed of classification are crucial for a classier to be useful in a high volume categorization environment. Thus, it is important to investigate the performance of the N-gram based classification methods. In particular, if it is possible to eliminate the counting and sorting operations in the rank-order statistics methods, classification speed could be increased substantially. The classifier described here accomplishes that goal by using a new Cumulative Frequency Addition method.

Letter Based Text Scoring Method for Language Identification

Lecture Notes in Computer Science, 2004

In recent years, an unexpected amount of growth has been observed in the volume of text documents on the internet, intranet, digital libraries and news groups. It is an important issue to obtain useful information and meaningful patterns from these documents. Identification of Languages of these text documents is an important problem which is studied by many researchers. In these researches generally words (terms) have been used for language identification. Researchers have studied on different approaches like linguistic and statistical based. In this work, Letter Based Text Scoring Method has been proposed for language identification. This method is based on letter distributions of texts. Text scoring has been performed to identify the language of each text document. Text scores are calculated by using letter distributions of new text document. Besides its acceptable accuracy proposed method is easier and faster than short terms and n-gram methods.

Language Identification Based on N-Gram Feature Extraction Method by Using Classifiers

Iu Journal of Electrical Electronics Engineering, 2013

The rising opportunities of communication provided us with many documents in many different languages. Language identification has a key role for these documents to be understandable and to study natural language identification procedures. The increasing number of the documents and international communication requirements make new works on language identification obligatory. Until today, there have been a great number of studies on solving language identification problem about document based language identification. In these studies, characters, words and n-gram sequences have been used with machine learning techniques. In this study, sequence of n-gram frequencies will be used and using of the five different classification algorithms' accuracy performances will be analyzed via different sizes of documents belonging to 15 different languages. N-gram based feature method will be used to extract feature vector belonging to languages. The most appropriate method for the problem of language identification will be identified by comparing the performances of the Support Vector Machines, Multilayer Perceptron, Centroid Classifier, k-Means and Fuzzy C Means methods. During the experiments, trainining and testing data will be selected from ECI multilingual corpus.

Implementation and Evaluation of a Language Identification System for Mono-and Multi-lingual Texts

Language identification is a classification task between a pre-defined model and a text in an unknown language. This paper presents the implementation of a tool for language identification for mono-and multi-lingual documents. The tool includes four algorithms for language identification. An evaluation for eight languages including Ukrainian and Russian and various text lengths is presented. It could be shown that n-gram-based approaches outperform word-based algorithms for short texts. For longer texts, the performance is comparable. The tool can also identify language changes within one multi-lingual document.

Automatic Language Identification: An Alternative Unsupervised Approach Using a New Hybrid Algorithm

International Journal of Computer Science & Applications, 2010

This paper deals with our research on unsupervised classification for automatic language identification purpose. The study of this new hybrid algorithm shows that the combination of the Kmeans and the artificial ants and taking advantage of an n-gram text representation is promising. We propose an alternative approach to the standard use of both algorithms. A multilingual text corpus is used to assess this approach. Taking into account that this method does not require a priori information (number of classes, initial partition), is able to quickly process large amount of data, and that the results can also be visualised. We can say that, these results are very promising and offer many perspectives.

AUTOMATIC LANGUAGE IDENTIFICATION SYSTEM

2000

This paper presents the language identication (LID) sys- tem developed in Speech@FIT. The system consists of two parts: Acoustic LID determines the language directly on the basis of features derived from the speech signal. We have improved existing approaches by adding discriminative train- ing of acoustic models. In phonotactic LID, speech is rst transcribed by phoneme recognizer into strings or

An n-gram-based language identification algorithm for variable-length and variable-language texts

The aim of this paper is to describe a new language identification method that uses language models based on character statistics, or more specifically, character n-gram frequency tables or Markov chains. An important advantage of this method is that it uses a very simple and straightforward algorithm, which is similar to those that have been used for the past 20 years for this purpose. In addition, it can also handle input such as target texts in an unknown language or more than one language, which the traditional approaches inherently classify incorrectly. We systematically compare and contrast our method with others that have been proposed in the literature, and measure its accuracy using a series of experiments. These experiments demonstrate that our solution works not only for whole documents but also delivers usable results for input strings as short as a single word, and the identification rate reaches 99.9 % for strings that are about 100 characters, i.e. a short sentence, in length.

Language identification incorporating lexical information

Proc. ICSLP, 1998

In this paper we explore the use of lexical information for language identification (LID). Our reference LID system uses language-dependent acoustic phone models and phone-based bigram language models. For each language, lexical information is introduced by augmenting the phone vocabulary with the N most frequent words in the training data. Combined phone and word bigram models are used to provide linguistic constraints during acoustic decoding. Experiments were carried out on a 4-language telephone speech corpus. Using lexical information achieves a relative error reduction of about 20% on spontaneous and read speech compared to the reference phone-based system. Identification rates of 92%, 96% and 99% are achieved for spontaneous, read and task-specific speech segments respectively, with prior speech detection.

Language Identification Using Minimum Linguistic Information

1998

Automatic spoken language identification is the problem of identifying the language being spoken from a sample of speech by an unknown speaker. Current language identification systems vary in their complexity. The systems that use higher level information have the best performance. Nevertheless, that information is hard to collect for each new language. In this work, we present a state of the art language identification system, which uses very little linguistic information, and so easily extendable to new languages. In fact, the presented system needs only one language specific phone recogniser (in our case the Portuguese one), and is trained with speech from each of the other languages. We studied the problem of language identification in the context of the European languages (including, for the first time, European Portuguese), which allowed us to study the effect of language proximity in Indo-European languages. The results reveal a significant impact on the identification of som...

Language Identification Using Combination of Machine Learning Algorithms and Vectorization Techniques

IEEE Xplore, 2022

Language Identification refers to the process of ascertaining and discerning the language found in a particular text or document. In this work, approaches for language identification, using Machine Learning Algorithms and Vectorization methods have been compared and contrasted. Three machine learning algorithms, along with two vectorization techniques have been used. The ML Algorithms used are Naïve bayes, Logistic Regression, and SVM (Support Vector Machine), and the vectorization techniques used are Term Frequency-Inverse Document Frequency (TF-IDF), and Count Vectorizer (Bag of Words (BoW)). This research put forwards the contrast and comparison of the above-mentioned classification algorithms and vectorization methods. It is also a web development-based work.

Modeling Text Language Identification for Ethiopian Cushitic Languages

Abstract In the past decade, successful natural language processing applications such as spelling and grammar correction in word processor, machine translation on the web, email spam detection, automatic question answering, and identification of language in textual or speech form have become part of our everyday experience. There are various approaches that can be used in textbased language identification. A pure linguistic approach would be the best candidate where high classification accuracies are needed though it requires a large amount of linguistic expertise. In this research n-gram frequency rank order and Naïve Bayes were compared as language identifier for Ethiopian Cushitic languages. Frequency of N-gram and its location in a word which is one of the contributions of this research and n-gram feature sets were compared for both models. Higher identification accuracy rate was achieved when n-gram and its location in a word frequency was used as a feature set on both models. The corpus for the study was collected from sources such as TV news websites, Bible, news bulletins, government documents, and documents from ministry of education to insure the corpus spans various domains .WebCorp tool was used to collect corpus from news web sites. Per language the size of collected text corpus after data cleaning varied from 71,712 words for Afar to 150000 words for Somali. Learning curve analysis was made using various training set size as a function fixed test to determine the size of corpus required for the experiment. Documents of sizes 15, 100, and 300 characters windows were used to evaluate the models. For test string of size 15 characters accuracy of 99.55% on character n-gram feature set and 99.78% on character n-gram and its location in a word feature set was achieved for Naïve Bayes classifier. . The identification accuracy rate of NB for both FS when the test sting size is more than 100 characters is 100%. For test string of size 300 characters using frequency rank order as a classifier, accuracy of 63.55% on character n-gram feature set and accuracy of 86.78% on character n-gram and its location in a word feature set was achieved. Key words: language identification, feature set, Naïve Bayes, n‐gram, corpus, n‐gram‐location

Language identification from small text samples

There is an increasing need to deal with multi-lingual documents today. If we could segment multi-lingual documents language-wise, it would be very useful both for exploration of linguistic phenomena, such as code-switching and code mixing, and for computational processing of each segment as appropriate. Identification of language from a given small piece of text is therefore an important problem. This paper is about language identification from small text samples.