Language identification from text using n-gram based cumulative frequency addition (original) (raw)

Language Identification based on n-gram Frequency Ranking

… Annual Conference of …, 2007

We present a novel approach for language identification based on a text categorization technique, namely an n-gram frequency ranking. We use a Parallel phone recognizer, the same as in PPRLM, but instead of the language model, we create a ranking with the most frequent n-grams, keeping only a fraction of them. Then we compute the distance between the input sentence ranking and each language ranking, based on the difference in relative positions for each n-gram. The objective of this ranking is to be able to model reliably a longer span than PPRLM, namely 5-gram instead of trigram, because this ranking will need less training data for a reliable estimation. We demonstrate that this approach overcomes PPRLM (6% relative improvement) due to the inclusion of 4gram and 5-gram in the classifier. We present two alternatives: ranking with absolute values for the number of occurrences and ranking with discriminative values (11% relative improvement).

An n-gram-based language identification algorithm for variable-length and variable-language texts

The aim of this paper is to describe a new language identification method that uses language models based on character statistics, or more specifically, character n-gram frequency tables or Markov chains. An important advantage of this method is that it uses a very simple and straightforward algorithm, which is similar to those that have been used for the past 20 years for this purpose. In addition, it can also handle input such as target texts in an unknown language or more than one language, which the traditional approaches inherently classify incorrectly. We systematically compare and contrast our method with others that have been proposed in the literature, and measure its accuracy using a series of experiments. These experiments demonstrate that our solution works not only for whole documents but also delivers usable results for input strings as short as a single word, and the identification rate reaches 99.9 % for strings that are about 100 characters, i.e. a short sentence, in length.

Language identification based on a discriminative text categorization technique

In this paper, we describe new results and improvements to a language identification (LID) system based on PPRLM previously introduced in [1] and [2]. In this case, we use as parallel phone recognizers the ones provided by the Brno University of Technology for Czech, Hungarian, and Russian languages, and instead of using traditional n-gram language models we use a language model that is created using a ranking with the most frequent and discriminative n-grams. In this language model approach, the distance between the ranking for the input sentence and the ranking for each language is computed, based on the difference in relative positions for each n-gram. This approach is able to model reliably longer span information than in traditional language models obtaining more reliable estimations. We also describe the modifications that we have being introducing along the time to the original ranking technique, e.g., different discriminative formulas to establish the ranking, variations of the template size, the suppression of repeated consecutive phones, and a new clustering technique for the ranking scores. Results show that this technique provides a 12.9% relative improvement over PPRLM. Finally, we also describe results where the traditional PPRLM and our ranking technique are combined.

Language Identification Based on N-Gram Feature Extraction Method by Using Classifiers

Iu Journal of Electrical Electronics Engineering, 2013

The rising opportunities of communication provided us with many documents in many different languages. Language identification has a key role for these documents to be understandable and to study natural language identification procedures. The increasing number of the documents and international communication requirements make new works on language identification obligatory. Until today, there have been a great number of studies on solving language identification problem about document based language identification. In these studies, characters, words and n-gram sequences have been used with machine learning techniques. In this study, sequence of n-gram frequencies will be used and using of the five different classification algorithms' accuracy performances will be analyzed via different sizes of documents belonging to 15 different languages. N-gram based feature method will be used to extract feature vector belonging to languages. The most appropriate method for the problem of language identification will be identified by comparing the performances of the Support Vector Machines, Multilayer Perceptron, Centroid Classifier, k-Means and Fuzzy C Means methods. During the experiments, trainining and testing data will be selected from ECI multilingual corpus.

Modeling Text Language Identification for Ethiopian Cushitic Languages

Abstract In the past decade, successful natural language processing applications such as spelling and grammar correction in word processor, machine translation on the web, email spam detection, automatic question answering, and identification of language in textual or speech form have become part of our everyday experience. There are various approaches that can be used in textbased language identification. A pure linguistic approach would be the best candidate where high classification accuracies are needed though it requires a large amount of linguistic expertise. In this research n-gram frequency rank order and Naïve Bayes were compared as language identifier for Ethiopian Cushitic languages. Frequency of N-gram and its location in a word which is one of the contributions of this research and n-gram feature sets were compared for both models. Higher identification accuracy rate was achieved when n-gram and its location in a word frequency was used as a feature set on both models. The corpus for the study was collected from sources such as TV news websites, Bible, news bulletins, government documents, and documents from ministry of education to insure the corpus spans various domains .WebCorp tool was used to collect corpus from news web sites. Per language the size of collected text corpus after data cleaning varied from 71,712 words for Afar to 150000 words for Somali. Learning curve analysis was made using various training set size as a function fixed test to determine the size of corpus required for the experiment. Documents of sizes 15, 100, and 300 characters windows were used to evaluate the models. For test string of size 15 characters accuracy of 99.55% on character n-gram feature set and 99.78% on character n-gram and its location in a word feature set was achieved for Naïve Bayes classifier. . The identification accuracy rate of NB for both FS when the test sting size is more than 100 characters is 100%. For test string of size 300 characters using frequency rank order as a classifier, accuracy of 63.55% on character n-gram feature set and accuracy of 86.78% on character n-gram and its location in a word feature set was achieved. Key words: language identification, feature set, Naïve Bayes, n‐gram, corpus, n‐gram‐location

Optimizing n‑gram Order of an n‑gram Based Language Identification Algorithm for 68 Written Languages

International Journal on Advances in Ict for Emerging Regions (icter), 2009

Language identification technology is widely used in the domains of machine learning and text mining. Many researchers have achieved excellent results on a few selected European languages. However, the majority of African and Asian languages remain untested. The primary objective of this research is to evaluate the performance of our new n-gram based language identification algorithm on 68 written languages used in the European, African and Asian regions. The secondary objective is to evaluate how n-gram orders and a mix n-gram model affect the relative performance and accuracy of language identification. The n-gram based algorithm used in this paper does not depend on the n-gram frequency. Instead, the algorithm is based on a Boolean method to determine the output of matching target n-grams to training n-grams. The algorithm is designed to automatically detect the language, script and character encoding scheme of a written text. It is important to identify these three properties due to the reason that a language can be written in different types of scripts and encoded with different types of character encoding schemes. The experimental results show that in one test the algorithm achieved up to 99.59% correct identification rate on selected languages. The results also show that the performance of language identification can be improved by using a mix n-gram model of bigram and trigram. The mix n-gram model consumed less disk space and computing time, compared to a trigram model.

Language identification of short text segments with n-gram models

2009

There are many accurate methods for language identification of long text samples, but identification of very short strings still presents a challenge. This paper studies a language identification task, in which the test samples have only 5-21 characters. We compare two distinct methods that are well suited for this task: a naive Bayes classifier based on character n-gram models, and the ranking method by . For the n-gram models, we test several standard smoothing techniques, including the current state-of-theart, the modified Kneser-Ney interpolation. Experiments are conducted with 281 languages using the Universal Declaration of Human Rights. Advanced language model smoothing techniques improve the identification accuracy and the respective classifiers outperform the ranking method. The higher accuracy is obtained at the cost of larger models and slower classification speed. However, there are several methods to reduce the size of an n-gram model, and our experiments with model pruning show that it provides an easy way to balance the size and the identification accuracy. We also compare the results to the language identifier in Google AJAX Language API, using a subset of 50 languages.

A comparative study on language identification methods

… ELRA). http://www. lrec-conf. org …, 2008

In this paper we present two experiments conducted for comparison of different language identification algorithms. Short words-, frequent words-and n-gram-based approaches are considered and combined with the Ad-Hoc Ranking classification method. The language identification ...

Implementation and Evaluation of a Language Identification System for Mono-and Multi-lingual Texts

Language identification is a classification task between a pre-defined model and a text in an unknown language. This paper presents the implementation of a tool for language identification for mono-and multi-lingual documents. The tool includes four algorithms for language identification. An evaluation for eight languages including Ukrainian and Russian and various text lengths is presented. It could be shown that n-gram-based approaches outperform word-based algorithms for short texts. For longer texts, the performance is comparable. The tool can also identify language changes within one multi-lingual document.