Language Identifcation.pdf (original) (raw)

The current proliferation of text data on the Internet in different languages spoken across the globe calls for a need to develop intelligent systems that can help in recognising the language in which particular texts are written for proper functioning of more complex tasks such as Language translation. Language identification, the process of determining the natural language in which a text document is written has always been a pivotal research in the areas of text mining and natural language processing. In the literatures, several statistical models have been reported for solving text language identification problem such as n-gram model, modified n-gram models etc. In this project work, a machine learning approach to language identification was used. Two supervised learning algorithms which are the multinomial Naïve Bayes and k-Nearest Neighbours algorithm were implemented for text language classification. The training and test data were gotten using Wikipedia’s Multilanguage features and also from various other sources on the Internet. The implementation was done in Java. The classifiers were trained to recognize different languages, three of which are local and recognized Nigerian languages (i.e. Hausa, Igbo and Yoruba).Performance comparison of the two algorithms in terms of speed and prediction accuracy under different working conditions was carried out and it was discovered that the Naïve Bayes classifier out performs the K-Nearest Neighbours algorithm.