Training & Evaluation of POS Taggers in Indo-Aryan Languages: A Case of Hindi, Odia and Bhojpuri (original) (raw)
Related papers
Manipuri POS Tagging using CRF and SVM: A Language Independent Approach
Part of Speech (POS) tagging is an important component for almost all Natural Language Processing (NLP) application areas. Applying machine-learning techniques to the less computerized languages require development of appropriately tagged corpus. In this paper, we have developed POS taggers for Manipuri, a less privileged language, using Conditional Random Field (CRF) and Support Vector Machine (SVM). We have manually annotated approximately 63,200 tokens, collected from the written texts with a POS tagset1 of 26 tags defined for the Indian languages. The POS taggers make use of the different contextual and orthographic word-level features. These features are language independent and applicable to other languages also. POS taggers have been trained, and tested with the 39449, and 8672 tokens, respectively. Evaluation results demonstrated the accuracies of 72.04%, and 74.38% in the CRF, and SVM, respectively.
An Experiment with the CRF++ Parts of Speech (POS) Tagger for Odia
This research work presents a probability-based CRF++ parts of speech (POS) tagger for Odia language. A corpus of approximately 600k tokens has been annotated manually in the Indian Languages Corpora Initiative (ILCI) project for Odia. The whole Odia corpus has been annotated based on the Bureau of Indian Standards (BIS) tagset developed by the DIT, govt. of India with some modifications under the ILCI. The tagger has been trained and tested with 2, 36, 793 and 1, 28, 646 tokens respectively. It provides 94.39% accuracy in the domain of seen data and 88.87% in the unseen dataset in precision and recall measures. In addition, this study further conducts an IA (inter-annotator) agreement, an error analysis to figure out salient erroneous labels committed by the automatic tagger and provides various suggestions to improve its efficiency. Furthermore, this study also provides the user-interface architecture and its functionalities.
Evaluation Of Hindi & Urdu Pos Tagged Corpus: A Comparative Study
Hindi & Urdu. The system compares automatically annotated corpora with manually annotated corpora (gold standard) and produces both black-box 1 and glass-box evaluation 2 . For the present work, evaluation has been done only for Urdu and Hindi POS annotation but CorpEvalS can be used for the evaluation of any language. Finally, this paper shows that several measures like the accuracy 3 , information retrieval (IR) metrics, confusion matrix 4 and ambiguous word analysis are essential for evaluation of POS annotation. 1 Black-box evaluation only sees the final output and its relationship to the original input. It measures a number of parameters related to the quality of the process (speed, reliability) and to the quality of the result (e.g. the accuracy of data annotation).
Evaluation of SVM-based Automatic Parts of Speech Tagger for Odia
The authors present an SVM-based POS tagger for Odia language in the paper. The tagger has been trained and tested with Indian Languages Corpora Initiative (ILCI) data of 2, 36, 793 and 1, 28, 646 tokens respectively which has been annotated following Bureau of Indian Standards (BIS) annotation scheme. The evaluation has been undertaken under two sections: the statistical and the human, guided by the two approaches of research: quantitative and qualitative. Evaluation results on precision, recall and F measure metrics demonstrate accuracy rates of 93.99%, 92.9971 and 93.49% respectively. So far as the human evaluation is concerned, the agreements are 93.89% (percentage agreement) and 0.87 (Fleiss’ Kappa). Finally, the issues and challenges have been discussed in relation to manual annotation and statistical tagger-related issues with a linguistic analysis of errors. On the basis of evaluation results, it can be stated that the present POS tagger is more efficient than the earlier Odia Neural Network tagger (81%) and the SVM tagger (82%) in terms of both accuracy and reliability of the tagger output data.
Automated Error Correction and Validation for POS Tagging of Hindi
2018
The Part-Of-Speech tag of a word can provide crucial information for a large number of tasks, and so, it is of utmost importance that the POS tagged data is accurate. However, manually checking the data is a tedious and time consuming task. Thus, there is a need for an Automatic Error Correction and Validation model for any POS Tagged Data. In this paper, we work towards achieving the aforementioned goal for Hindi POS Tagging. This is achieved by using an ensemble model consisting of three POS Tagging Models. Based on the predictions made by the three models, and the POS tag present in the dataset, the ensemble model predicts the presence of an error. The POS tagging models explored were the Hidden Markov Model, Support Vector Machine, Conditional Random Fields, Long Short Term Memory (LSTM) Networks, Bidirectional LSTM Networks, and Logistic Regression. A Fully Connected Neural Network was used to build the ensemble model, and it achieved an accuracy of 94.02%.
CURRENT STATE OF THE ART POS TAGGING FOR INDIAN LANGUAGES – A STUDY
iaeme
Parts-of-speech (POS) tagging is the basic building block of any Natural Language Processing (NLP) tool. A POS tagger has many applications. Especially for Indian languages, POS tagging adds many more dimensions as most of them are agglutinative, morphologically very rich highly inflected and are sometimes diglossic. Taggers have been developed using linguistic rules, stochastic models or both. This paper is a survey about different POS taggers developed for eight Indian Language, namely Hindi, Bengali, Tamil, Telugu, Gujarati, Malayalam, Manipuri and Assamese in the recent past.
Survey of various POS tagging techniques for Indian regional languages
2015
Part of Speech tagging (POS) is an important tool for processing natural languages. It is one of the simplest as well as most stable and statistical model for many Natural language processing (NLP) applications. It is the process of marking up a word in a corpus as corresponding to a particular part of speech like noun, verb, adjective and adverb. There are many challenges in POS tagging like Foreign words, Ambiguities, ungrammatical input etc. In this paper, comparison of various POS tagging techniques for Indian regional languages has been discussed elaborately. Keywords— POS-Part of Speech, Rule Based Approach Statistical Approach, Hybrid Approach.
EXPERIMENTAL ANALYSIS OF MALAYALAM POS TAGGER USING
In Natural Language Processing (NLP), one of the well-studiedproblems under constant exploration is part-ofspeech tagging or POS tagging or grammatical tagging. The task is to assign labels or syntactic categories such as noun, verb, adjective, adverb, preposition etc. to the words in a sentence or in an un-annotated corpus. This paper presents a simple machine learning based experimental study for POS tagging using a new structured prediction framework known as EPIC, developed in scale programming language. This paper is first of its kind to perform POS tagging in Indian Language using EPIC framework. In this framework, the corpus contains labelled Malayalam sentences in domains like health, tourism and general (news, stories). The EPIC framework uses conditional random field (CRF) for building tagged models. The framework provides several parameters to adjust and arrive at improved accuracy and thereby a better POS tagger model. The overall accuracy were calculated separately for each domains and obtained a maximum accuracy of 85.48%, 85.39%, and 87.35% for small tagged data in health, tourism and general domain.
Issues and Challenges in Developing Statistical POS Taggers for Sambalpuri
Proceedings of 7th Language & Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics, ISBN No-978-83-932640-8-7, 2015
Low-density languages are also known as lesser-known, poorly-described, less-resourced, minority or less-computerized language because they have fewer resources available. Collecting and annotating a voluminous corpus for these languages prove to be quite daunting. For developing any NLP application for a low-density language, one needs to have an annotated corpus and a standard scheme for annotation. Because of their non-standard usage in text and other linguistic nuances, they pose significant challenges that are of linguistic and technical in nature. The present paper highlights some of the underlying issues and challenges in developing statistical POS taggers applying SVM and CRF++ for Sambalpuri, a less-resourced Eastern Indo-Aryan language. A corpus of approximately 121k is collected from the web and converted into Unicode encoding. The whole corpus is annotated under the BIS (Bureau of Indian Standards) annotation scheme devised for Odia under the ILCI (Indian Languages Corpora Initiative) Corpora Project. Both the taggers are trained and tested with approximately 80k and 13k respectively. The SVM tagger provides 83% accuracy while the CRF++ has 71.56% which is less in comparison to the former.
INSIGHT OF VARIOUS POS TAGGING TECHNIQUES FOR HINDI LANGUAGE
Natural language processing (NLP), is the process of extracting meaningful information from natural language. Part of speech (POS) tagging is considered as one of the important tools, for Natural language processing. Part of speech is a process of assigning a tag to every word in the sentences, as a particular part of speech, such as Noun, pronoun, adjective, verb, adverb, preposition, conjunction etc. Hindi is a natural language, so there is a need to perform natural language processing on Hindi sentence. This paper discussed a hybrid based approach, for POS tagging on Hindi corpus. This paper discussed a review of different Techniques, for Part of Speech tagging of Hindi language. KEYWORDS: Hidden Markov Model, POS Tagging, Hindi Word Net & Hybrid.