Issues and Challenges in Developing Statistical POS Taggers for Sambalpuri (original) (raw)

Manipuri POS Tagging using CRF and SVM: A Language Independent Approach

Part of Speech (POS) tagging is an important component for almost all Natural Language Processing (NLP) application areas. Applying machine-learning techniques to the less computerized languages require development of appropriately tagged corpus. In this paper, we have developed POS taggers for Manipuri, a less privileged language, using Conditional Random Field (CRF) and Support Vector Machine (SVM). We have manually annotated approximately 63,200 tokens, collected from the written texts with a POS tagset1 of 26 tags defined for the Indian languages. The POS taggers make use of the different contextual and orthographic word-level features. These features are language independent and applicable to other languages also. POS taggers have been trained, and tested with the 39449, and 8672 tokens, respectively. Evaluation results demonstrated the accuracies of 72.04%, and 74.38% in the CRF, and SVM, respectively.

Training & Evaluation of POS Taggers in Indo-Aryan Languages: A Case of Hindi, Odia and Bhojpuri

Proceedings of 7th Language & Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics, ISBN No-978-83-932640-8-7, 2015

The present paper discusses the training and evaluation of the CRF and SVM algorithms for Indo-Aryan languages: Hindi, Odia and Bhojpuri. For annotation of the corpus, we have used Bureau of Indian Standards (BIS) annotation scheme which is a common standard of annotation for Indian languages. The main objective of the paper is to provide an idea of the error pattern and suggestions following the same algorithms. The experiment is conducted with 90k tokens training and 2k tokens test data each, for ease of comparison among languages. In the evaluation report, we focus on each tool (SVM and CRF++) at the level of accuracy, error analysis of the tools, the error pattern and common error of the system. The accuracy of the SVM taggers ranges between 88 to 93.7 % whereas CRF ranges between 82 to 86.7%. CRF performs less qualitatively than SVM for Odia and Hindi which is not true for Bhojpuri. In this study, we have observed that languages having more variations are suitable for CRF in comparison to SVM.

A Hybrid POS Tagger for Khasi, an Under Resourced Language

International Journal of Advanced Computer Science and Applications

Khasi is an Austro-Asiatic language spoken mainly in the state of Meghalaya, India, and can be considered as an under resourced and under studied language from the natural language processing perspective. Part-of-speech (POS) tagging is one of the major initial requirements in any natural language processing tasks where part of speech is assigned automatically to each word in a sentence. Therefore, it is only natural to initiate the development of a POS tagger for Khasi and this paper presents the construction of a Hybrid POS tagger for Khasi. The tagger is developed to address the tagging errors of a Khasi Hidden Markov Model (HMM) POS tagger by integrating conditional random fields (CRF). This integration incorporates language features which are otherwise not feasible in an HMM POS tagger. The results of the Hybrid Khasi tagger have shown significant improvement in the tagger's accuracy as well as substantially reducing most of the tagging confusion of the HMM POS tagger.

ODIA PARTS OF SPEECH TAGGING CORPORA: SUITABILITY OF STATISTICAL MODELS

This study focusses on developing statistical POS taggers for Odia using two distinct algorithms CRF (probability) and SVM (classifier). Approximately, 400k tokens have been applied to develop both of them with the training and testing data estimating to 236k and 123k tokens respectively. For annotating the whole ILCI corpus the BIS annotation scheme has been taken into consideration with some modifications. So far as the experimental set up is concerned, similar feature has been selected to train both the models. Evaluation has been conducted on the precision and recall measures for CRF and known-unknown words accuracy for SVM. A comprehensive error analysis has been conducted to figure out the types of errors committed by both in common based on which 5-fold manual error correction and final evaluation have been conducted. After identifying and discussing issues, different solutions have been proposed: formulation of linguistic rules, corpus-driven, word sense disambiguation, and application of external tools like NER, WSD, morph analyser. Finally, the taggers are made online using JSP and JST technology. Both the taggers, CRF++ (94.39 and 88.87) and SVM (96.85 and 93.59), have outperformed the existing Odia POS taggers in terms of both reliability and accuracy. For ensuring the quality of the output, an IA agreement has been conducted.

CURRENT STATE OF THE ART POS TAGGING FOR INDIAN LANGUAGES – A STUDY

iaeme

Parts-of-speech (POS) tagging is the basic building block of any Natural Language Processing (NLP) tool. A POS tagger has many applications. Especially for Indian languages, POS tagging adds many more dimensions as most of them are agglutinative, morphologically very rich highly inflected and are sometimes diglossic. Taggers have been developed using linguistic rules, stochastic models or both. This paper is a survey about different POS taggers developed for eight Indian Language, namely Hindi, Bengali, Tamil, Telugu, Gujarati, Malayalam, Manipuri and Assamese in the recent past.

Morphological Richness Offsets Resource Demand - Experiences in Constructing a POS Tagger for Hindi

2006

In this paper we report our work on building a POS tagger for a morphologically rich language-Hindi. The theme of the research is to vindicate the stand that-if morphology is strong and harnessable, then lack of training corpora is not debilitating. We establish a methodology of POS tagging which the resource disadvantaged (lacking annotated corpora) languages can make use of. The methodology makes use of locally annotated modestly-sized corpora (15,562 words), exhaustive morpohological analysis backed by high-coverage lexicon and a decision tree based learning algorithm (CN2). The evaluation of the system was done with 4-fold cross validation of the corpora in the news domain (www.bbc.co.uk/hindi). The current accuracy of POS tagging is 93.45% and can be further improved.

Cross Language POS Taggers (and other Tools) for Indian Languages: An Experiment with Kannada using Telugu Resources

sivareddy.in

Indian languages are known to have a large speaker base, yet some of these languages have minimal or non-efficient linguistic resources. For example, Kannada is relatively resource-poor compared to Malayalam, Tamil and Telugu, which in-turn are relatively poor compared to Hindi. Many Indian language pairs exhibit high similarities in morphology and syntactic behaviour e.g. Kannada is highly similar to Telugu. In this paper, we show how to build a cross-language part-of-speech tagger for Kannada exploiting the resources of Telugu. We also build large corpora and a morphological analyser (including lemmatisation) for Kannada. Our experiments reveal that a cross-language taggers are as efficient as mono-lingual taggers. We aim to extend our work to other Indian languages. Our tools are efficient and significantly faster than the existing monolingual tools.

Evaluation Of Hindi & Urdu Pos Tagged Corpus: A Comparative Study

Hindi & Urdu. The system compares automatically annotated corpora with manually annotated corpora (gold standard) and produces both black-box 1 and glass-box evaluation 2 . For the present work, evaluation has been done only for Urdu and Hindi POS annotation but CorpEvalS can be used for the evaluation of any language. Finally, this paper shows that several measures like the accuracy 3 , information retrieval (IR) metrics, confusion matrix 4 and ambiguous word analysis are essential for evaluation of POS annotation. 1 Black-box evaluation only sees the final output and its relationship to the original input. It measures a number of parameters related to the quality of the process (speed, reliability) and to the quality of the result (e.g. the accuracy of data annotation).

An Experiment with the CRF++ Parts of Speech (POS) Tagger for Odia

This research work presents a probability-based CRF++ parts of speech (POS) tagger for Odia language. A corpus of approximately 600k tokens has been annotated manually in the Indian Languages Corpora Initiative (ILCI) project for Odia. The whole Odia corpus has been annotated based on the Bureau of Indian Standards (BIS) tagset developed by the DIT, govt. of India with some modifications under the ILCI. The tagger has been trained and tested with 2, 36, 793 and 1, 28, 646 tokens respectively. It provides 94.39% accuracy in the domain of seen data and 88.87% in the unseen dataset in precision and recall measures. In addition, this study further conducts an IA (inter-annotator) agreement, an error analysis to figure out salient erroneous labels committed by the automatic tagger and provides various suggestions to improve its efficiency. Furthermore, this study also provides the user-interface architecture and its functionalities.