Letter Based Text Scoring Method for Language Identification (original) (raw)

Implementation and Evaluation of a Language Identification System for Mono-and Multi-lingual Texts

Language identification is a classification task between a pre-defined model and a text in an unknown language. This paper presents the implementation of a tool for language identification for mono-and multi-lingual documents. The tool includes four algorithms for language identification. An evaluation for eight languages including Ukrainian and Russian and various text lengths is presented. It could be shown that n-gram-based approaches outperform word-based algorithms for short texts. For longer texts, the performance is comparable. The tool can also identify language changes within one multi-lingual document.

An n-gram-based language identification algorithm for variable-length and variable-language texts

The aim of this paper is to describe a new language identification method that uses language models based on character statistics, or more specifically, character n-gram frequency tables or Markov chains. An important advantage of this method is that it uses a very simple and straightforward algorithm, which is similar to those that have been used for the past 20 years for this purpose. In addition, it can also handle input such as target texts in an unknown language or more than one language, which the traditional approaches inherently classify incorrectly. We systematically compare and contrast our method with others that have been proposed in the literature, and measure its accuracy using a series of experiments. These experiments demonstrate that our solution works not only for whole documents but also delivers usable results for input strings as short as a single word, and the identification rate reaches 99.9 % for strings that are about 100 characters, i.e. a short sentence, in length.

Automatic language identification of written texts

2004

Language identification is one of the search keys of most widespread use in the Internet. This article describes efficient and easily extensible solutions to the problem of identifying the language of written texts based on closed grammatical classes. An identification tool was developed for recognizing texts written in Portuguese, Spanish, French and English.

Language identification based on a discriminative text categorization technique

In this paper, we describe new results and improvements to a language identification (LID) system based on PPRLM previously introduced in [1] and [2]. In this case, we use as parallel phone recognizers the ones provided by the Brno University of Technology for Czech, Hungarian, and Russian languages, and instead of using traditional n-gram language models we use a language model that is created using a ranking with the most frequent and discriminative n-grams. In this language model approach, the distance between the ranking for the input sentence and the ranking for each language is computed, based on the difference in relative positions for each n-gram. This approach is able to model reliably longer span information than in traditional language models obtaining more reliable estimations. We also describe the modifications that we have being introducing along the time to the original ranking technique, e.g., different discriminative formulas to establish the ranking, variations of the template size, the suppression of repeated consecutive phones, and a new clustering technique for the ranking scores. Results show that this technique provides a 12.9% relative improvement over PPRLM. Finally, we also describe results where the traditional PPRLM and our ranking technique are combined.

Language Identifcation.pdf

The current proliferation of text data on the Internet in different languages spoken across the globe calls for a need to develop intelligent systems that can help in recognising the language in which particular texts are written for proper functioning of more complex tasks such as Language translation. Language identification, the process of determining the natural language in which a text document is written has always been a pivotal research in the areas of text mining and natural language processing. In the literatures, several statistical models have been reported for solving text language identification problem such as n-gram model, modified n-gram models etc. In this project work, a machine learning approach to language identification was used. Two supervised learning algorithms which are the multinomial Naïve Bayes and k-Nearest Neighbours algorithm were implemented for text language classification. The training and test data were gotten using Wikipedia’s Multilanguage features and also from various other sources on the Internet. The implementation was done in Java. The classifiers were trained to recognize different languages, three of which are local and recognized Nigerian languages (i.e. Hausa, Igbo and Yoruba).Performance comparison of the two algorithms in terms of speed and prediction accuracy under different working conditions was carried out and it was discovered that the Naïve Bayes classifier out performs the K-Nearest Neighbours algorithm.

Optimizing n‑gram Order of an n‑gram Based Language Identification Algorithm for 68 Written Languages

International Journal on Advances in Ict for Emerging Regions (icter), 2009

Language identification technology is widely used in the domains of machine learning and text mining. Many researchers have achieved excellent results on a few selected European languages. However, the majority of African and Asian languages remain untested. The primary objective of this research is to evaluate the performance of our new n-gram based language identification algorithm on 68 written languages used in the European, African and Asian regions. The secondary objective is to evaluate how n-gram orders and a mix n-gram model affect the relative performance and accuracy of language identification. The n-gram based algorithm used in this paper does not depend on the n-gram frequency. Instead, the algorithm is based on a Boolean method to determine the output of matching target n-grams to training n-grams. The algorithm is designed to automatically detect the language, script and character encoding scheme of a written text. It is important to identify these three properties due to the reason that a language can be written in different types of scripts and encoded with different types of character encoding schemes. The experimental results show that in one test the algorithm achieved up to 99.59% correct identification rate on selected languages. The results also show that the performance of language identification can be improved by using a mix n-gram model of bigram and trigram. The mix n-gram model consumed less disk space and computing time, compared to a trigram model.

Language Identification Based on N-Gram Feature Extraction Method by Using Classifiers

Iu Journal of Electrical Electronics Engineering, 2013

The rising opportunities of communication provided us with many documents in many different languages. Language identification has a key role for these documents to be understandable and to study natural language identification procedures. The increasing number of the documents and international communication requirements make new works on language identification obligatory. Until today, there have been a great number of studies on solving language identification problem about document based language identification. In these studies, characters, words and n-gram sequences have been used with machine learning techniques. In this study, sequence of n-gram frequencies will be used and using of the five different classification algorithms' accuracy performances will be analyzed via different sizes of documents belonging to 15 different languages. N-gram based feature method will be used to extract feature vector belonging to languages. The most appropriate method for the problem of language identification will be identified by comparing the performances of the Support Vector Machines, Multilayer Perceptron, Centroid Classifier, k-Means and Fuzzy C Means methods. During the experiments, trainining and testing data will be selected from ECI multilingual corpus.

Language Identification of Web Pages Based on Improved N-gram Algorithm

2011

Language identification of written text in the domain of Latinscript based languages is a well-studied research field. However, new challenges arise when it is applied to non-Latin-script based languages, especially for Asian languages' web pages. The objective of this paper is to propose and evaluate the effectiveness of adapting Universal Declaration of Human Rights and Biblical texts as a training corpus, together with two new heuristics to improve an n-gram based language identification algorithm for Asian languages. Extension of the training corpus produced improved accuracy. Improvement was also achieved by using byte-sequence based HTML parser and a HTML character entities converter. The performance of the algorithm was evaluated based on a written text corpus of 1,660 web pages, spanning 182 languages from Asia, Africa, the Americas, Europe and Oceania. Experimental result showed that the algorithm achieved a language identification accuracy rate of 94.04%.

Language identification in web pages

2005

Abstract This paper discusses the problem of automatically identifying the language of a given Web document. Previous experiments in language guessing focused on analyzing" coherent" text sentences, whereas this work was validated on texts from the Web, often presenting harder problems. Our language" guessing" software uses a well-known n-gram based algorithm, complemented with heuristics and a new similarity measure.

Language identification from text using n-gram based cumulative frequency addition

2004

This paper describes the preliminary results of an efficient language classifier using an ad-hoc Cumulative Frequency Addition of N-grams. The new classification technique is simpler than the conventional Naïve Bayesian classification method, but it performs similarly in speed overall and better in accuracy on short input strings. The classifier is also 5-10 times faster than N-gram based rank-order statistical classifiers. Language classification using N-gram based rank-order statistics has been shown to be highly accurate and insensitive to typographical errors, and, as a result, this method has been extensively researched and documented in the language processing literature. However, classification using rank-order statistics is slower than other methods due to the inherent requirement of frequency counting and sorting of N-grams in the test document profile. Accuracy and speed of classification are crucial for a classier to be useful in a high volume categorization environment. Thus, it is important to investigate the performance of the N-gram based classification methods. In particular, if it is possible to eliminate the counting and sorting operations in the rank-order statistics methods, classification speed could be increased substantially. The classifier described here accomplishes that goal by using a new Cumulative Frequency Addition method.