Language Identification on the Web: Extending the Dictionary Method (original) (raw)
Abstract
Automated language identification of written text is a well-established research domain that has received considerable attention in the past. By now, efficient and effective algorithms based on character _n_-grams are in use, mainly with identification based on Markov models or on character _n_-gram profiles. In this paper we investigate the limitations of these approaches when applied to real-world web pages. The challenges to be overcome include language identification on very short texts, correctly handling texts of unknown language and texts comprised of multiple languages. We propose and evaluate a new method, which constructs language models based on word relevance and addresses these limitations. We also extend our method to allow us to efficiently and automatically segment the input text into blocks of individual languages, in case of multiple-language documents.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
- Ingle, N.: A Language Identification Table. Technical Translation International (1980)
Google Scholar - Dunning, T.: Statistical Identification of Language (1994)
Google Scholar - Cavnar, W.B., Trenkle, J.M.: N-gram-based text categorization. In: Ann Arbor MI, pp. 161–175 (1994)
Google Scholar - Grefenstette, G.: Comparing two language identification schemes. In: Proceedings of the 3rd International Conference on the Statistical Analysis of Textual Data (JADT 1995) (1995)
Google Scholar - Teahan, W.: Text classification and segmentation using minimum cross-entropy. In: Proceeding of RIAO 2000, 6th International Conference Recherche d’Information Assistee par Ordinateur, Paris, FR, pp. 943–961 (2000)
Google Scholar - Souter, C., Churcher, G., Hayes, J., Hughes, J., Johnson, S.: Natural Language Identification Using Corpus-Based Models. Hermes Journal of Linguistics 13, 183–203 (1994)
Google Scholar - Kilgarriff, A.: Web as corpus. In: Proceedings of Corpus Linguistics 2001, pp. 342–344 (2001)
Google Scholar - Kornai, A., et al.: Classifying the Hungarian Web. In: Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics, Association for Computational Linguistics Morristown, NJ, USA, vol. 1, pp. 203–210 (2003)
Google Scholar - Morrison, D.: PATRICIA – Practical Algorithm To Retrieve Information Coded in Alphanumeric. Journal of the ACM (JACM) 15(4), 514–534 (1968)
Article Google Scholar - Wikimedia Foundation Project: Wikipedia Static HTML Dumps (June 2008), http://static.wikipedia.org/
Author information
Authors and Affiliations
- Masaryk University in Brno, Czech Republic
Radim Řehůřek - Seznam.cz, a.s., Czech Republic
Milan Kolkus
Authors
- Radim Řehůřek
You can also search for this author inPubMed Google Scholar - Milan Kolkus
You can also search for this author inPubMed Google Scholar
Editor information
Editors and Affiliations
- National Polytechnic Institute, Center for Computing Research, 07738, Mexico City, Mexico
Alexander Gelbukh
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Řehůřek, R., Kolkus, M. (2009). Language Identification on the Web: Extending the Dictionary Method. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2009. Lecture Notes in Computer Science, vol 5449. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-00382-0\_29
Download citation
- .RIS
- .ENW
- .BIB
- DOI: https://doi.org/10.1007/978-3-642-00382-0\_29
- Publisher Name: Springer, Berlin, Heidelberg
- Print ISBN: 978-3-642-00381-3
- Online ISBN: 978-3-642-00382-0
- eBook Packages: Computer ScienceComputer Science (R0)