A Trigram HMM Model For Solving Parts-of-Speech (PoS) Tagging Problems (original) (raw)

Improving part-of-speech tagging using lexicalized HMMs

Natural Language Engineering, 2004

We introduce a simple method to build Lexicalized Hidden Markov Models (L-HMMs) for improving the precision of part-of-speech tagging. This technique enriches the contextual Language Model taking into account a set of selected words empirically obtained. The evaluation was conducted with different lexicalization criteria on the Penn Treebank corpus using the TnT tagger. This lexicalization obtained about a 6% reduction of the tagging error, on an unseen data test, without reducing the efficiency of the system. We have also studied how the use of linguistic resources, such as dictionaries and morphological analyzers, improves the tagging performance. Furthermore, we have conducted an exhaustive experimental comparison that shows that Lexicalized HMMs yield results which are better than or similar to other state-of-the-art part-of-speech tagging approaches. Finally, we have applied Lexicalized HMMs to the Spanish corpus LexEsp.

A second-order Hidden Markov Model for part-of-speech tagging

Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics -, 1999

This paper describes an extension to the hidden Markov model for part-of-speech tagging using second-order approximations for both contextual and lexical probabilities. This model increases the accuracy of the tagger to state of the art levels. These approximations make use of more contextual information than standard statistical systems. New methods of smoothing the estimated probabilities are also introduced to address the sparse data problem.

TOWARDS EFFICIENT PART-OF-SPEECH TAGGING FOR THE KANURI LANGUAGE: A HIDDEN MARKOV MODEL-BASED SOLUTION

Nigerian Journal of Engineering Science and Technology Research, 2024

Kanuri is a Nilo-Saharan language spoken in the Lake Chad basin of West and Central Africa. Effective part-of-speech (POS) tagging is crucial for natural language processing tasks in Kanuri, such as machine translation, information extraction, and text generation. However, the lack of comprehensive linguistic resources and annotated datasets for Kanuri has hindered the development of accurate POS taggers for this language. This study aims to develop a part-of-speech tagger for the Kanuri language using a Hidden Markov Model (HMM) approach. The goal is to create a robust and accurate POS tagging system that can be used to support various natural language processing applications for the Kanuri language. The study involved a development of corpus of Kanuri text collected from various sources. A HMM-based POS tagging model was designed and trained on the annotated Kanuri corpus. The HMM-based POS tagger achieved an overall accuracy of accuracy of 0.827 % on the Kanuri test data. The developed HMM-based POS tagger can be integrated into various natural language processing pipelines for Kanuri, enabling more advanced language analysis and understanding tasks. Additionally, the annotated Kanuri corpus can be used to expand and improve the POS tagging model further and support the development of other language technologies for the Kanuri language.

HMM based POS Tagger for a Relatively Free Word Order Language

We present an implementation of a part-of-speech tagger based on hidden markov model for Tamil, a relatively free word order, morphologically productive and agglutinative language. In HMM we assume that probability of an item in a sequence depends on its immediate predecessor. That is the tag for the current word depends up on the previous word and its tag. Here in the state sequence the tags are considered as states and the transition from one state to another state has a transition probability. The emission probability is the probability of observing a symbol in a particular state. In achieving this, we use viterbi algorithm. The basic tag set including the inflection is 53. Tamil being an agglutinative language, each word has different combinations of tags. Compound words are also used very often. So, the tagset increases to 350, as the combinations become high. A training corpus of 25000 words is trained and over 5000 words are tested. The raining corpus is tagged with the combination of basic tags and tags for inflection of the word. The evaluation gives encouraging result.

Part of Speech Tagging in Manipuri with Hidden Markov Model

Part of Speech tagging in Manipuri is a very complex task as Manipuri is highly agglutinating in nature. There is no enough tagged corpus for Manipuri which can be used in any statistical analysis of the language. In this tagging model we are using tagged output of the Manipuri rule-based tagger as tagged corpus. The present paper expounds the Part of Speech Tagging in Manipuri by applying a stochastic model called Hidden Markov Model.

Hidden Markov model with rule based approach for part of speech tagging of Myanmar language

Proceedings of the 3rd International Conference on …, 2009

Part-of-Speech (POS) Tagging is the process of assigning the words with their categories that best suits the definition of the word as well as the context of the sentence in which it is used. There are different approaches to the problem of POS Tagging. In this paper, we use two approaches (rule based and Hidden Markov Model with rule based approach), and compare the performance of these techniques for Tagging using Myanmar language. These approaches use supervised POS Tagging that requires a large amount of annotated training corpus to tag properly. At this initial stage of POS Tagging for Myanmar Language, we have very limited resource of annotated corpus. We tried to see which technique maximizes the performance with this limited resources. By experiments, the best configuration is investigated using HMM with rule based approach and the accuracy is 97.56%. Therefore, this approach has better performance than rule based approach.

HMM BASED POS TAGGER FOR HINDI

Part of Speech tagging in Indian Languages is still an open problem. We still lack a clear approach in implementing a POS tagger for Indian Languages. In this paper we describe our efforts to build a Hidden Markov Model based Part of Speech Tagger. We have used IL POS tag set for the development of this tagger. We have achieved the accuracy of 92%.

Computational Analysis of Part of Speech Tagging

2012

In order, to make text a suitable input to an automatic method of information extraction it is usually transformed from unstructured source of information into a structured format. Part of Speech Tagging is one of the preprocessing steps which assign one of the parts of speech to the given word. In this paper we had discussed various models of supervised and unsupervised technique shown the comparison of various techniques based on accuracy, and experimentally compared the results obtained in models of Condition Random Field and Maximum Entropy model. We had deployed a model of part of speech tagger for which we had compared the results with other models. The developed is based on HMM approach and had shown good results in terms of efficiency in comparison with other models.

Part of Speech Tagging for Bengali with Hidden Markov Model

Proceeding of the NLPAI Machine Learning …, 2006

This report describes our work on Bengali Part-of-speech tagging (POS) for the NLPAI Machine Learning contest 2006. We use a Hidden Markov Model (HMM) based stochastic tagger. The tagger makes use of morphological and contextual information of words. Since only a small labeled training set is provided (41,000 words), a HMM based approach does not yield very good results. In this work, we have used a morphological analyzer to improve the performance of the tagger. Further, we have made use of semi-supervised learning by augmenting the small labeled training set provided with a larger unlabeled training set (100,000 words). The tagger has an accuracy of about 89% on the test data provided.

Analysis of Part of Speech Tagging

2012

In the area of text mining, Natural Language Processing is an emerging field. As text is an unstructured source of information, to make it a suitable input to an automatic method of information extraction it is usually transformed into a structured format. Part of Speech Tagging is one of the preprocessing steps which perform semantic analysis by assigning one of the parts of speech to the given word. In this paper we had discussed various models of supervised and unsupervised technique shown the comparison of various techniques based on accuracy, and experimentally compared the results obtained in models of Supervised Condition Random Field and Supervised Maximum Entropy model. We had deployed a model of part of speech tagger based on Hidden Markov Model approach and had compare the results with other models. Also we had discussed the problem occurring with supervised part of speech tagging.