A Trigram HMM Model For Solving Parts-of-Speech (PoS) Tagging Problems (original) (raw)

Improving part-of-speech tagging using lexicalized HMMs

Natural Language Engineering, 2004

We introduce a simple method to build Lexicalized Hidden Markov Models (L-HMMs) for improving the precision of part-of-speech tagging. This technique enriches the contextual Language Model taking into account a set of selected words empirically obtained. The evaluation was conducted with different lexicalization criteria on the Penn Treebank corpus using the TnT tagger. This lexicalization obtained about a 6% reduction of the tagging error, on an unseen data test, without reducing the efficiency of the system. We have also studied how the use of linguistic resources, such as dictionaries and morphological analyzers, improves the tagging performance. Furthermore, we have conducted an exhaustive experimental comparison that shows that Lexicalized HMMs yield results which are better than or similar to other state-of-the-art part-of-speech tagging approaches. Finally, we have applied Lexicalized HMMs to the Spanish corpus LexEsp.

A Hidden Markov Model-based Part of Speech Tagger for Shekki’noono Language

International Journal of Computing, 2021

Natural language processing plays a great role in providing an interface for human-computer communication. It enables people to talk with the computer in their formal language rather than machine language. This study aims at presenting a Part of speech tagger that can assign word class to words in a given paragraph sentence. Some of the researchers developed parts of speech taggers for different languages such as English Amharic, Afan Oromo, Tigrigna, etc. On the other hand, many other languages do not have POS taggers like Shekki'noono language. POS tagger is incorporated in most natural language processing tools like machine translation, information extraction as a basic component. So, it is compulsory to develop a part of speech tagger for languages then it is possible to work with an advanced natural language application. Because those applications enhance machine to machine, machine to human, and human to human communications. Although, one language POS tagger cannot be directly applied for other languages POS tagger. With the purpose for developing the Shekki'noono POS tagger, we have used the stochastic Hidden Markov Model. For the study, we have used 1500 sentences collected from different sources such as newspapers (which includes social, economic, and political aspects), modules, textbooks, Radio Programs, and bulletins. The collected sentences are labeled by language experts with their appropriate parts of speech for each word. With the experiments carried out, the part of speech tagger is trained on the training sets using Hidden Markov model. As experiments showed, HMM based POS tagging has achieved 92.77 % accuracy for Shekki'noono. And the POS tagger model is compared with the previous experiments in related works using HMM. As a future work, the proposed approaches can be utilized to perform an evaluation on a larger corpus.

A second-order Hidden Markov Model for part-of-speech tagging

Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics -, 1999

This paper describes an extension to the hidden Markov model for part-of-speech tagging using second-order approximations for both contextual and lexical probabilities. This model increases the accuracy of the tagger to state of the art levels. These approximations make use of more contextual information than standard statistical systems. New methods of smoothing the estimated probabilities are also introduced to address the sparse data problem.

TOWARDS EFFICIENT PART-OF-SPEECH TAGGING FOR THE KANURI LANGUAGE: A HIDDEN MARKOV MODEL-BASED SOLUTION

Nigerian Journal of Engineering Science and Technology Research, 2024

Kanuri is a Nilo-Saharan language spoken in the Lake Chad basin of West and Central Africa. Effective part-of-speech (POS) tagging is crucial for natural language processing tasks in Kanuri, such as machine translation, information extraction, and text generation. However, the lack of comprehensive linguistic resources and annotated datasets for Kanuri has hindered the development of accurate POS taggers for this language. This study aims to develop a part-of-speech tagger for the Kanuri language using a Hidden Markov Model (HMM) approach. The goal is to create a robust and accurate POS tagging system that can be used to support various natural language processing applications for the Kanuri language. The study involved a development of corpus of Kanuri text collected from various sources. A HMM-based POS tagging model was designed and trained on the annotated Kanuri corpus. The HMM-based POS tagger achieved an overall accuracy of accuracy of 0.827 % on the Kanuri test data. The developed HMM-based POS tagger can be integrated into various natural language processing pipelines for Kanuri, enabling more advanced language analysis and understanding tasks. Additionally, the annotated Kanuri corpus can be used to expand and improve the POS tagging model further and support the development of other language technologies for the Kanuri language.

HMM based POS Tagger for a Relatively Free Word Order Language

We present an implementation of a part-of-speech tagger based on hidden markov model for Tamil, a relatively free word order, morphologically productive and agglutinative language. In HMM we assume that probability of an item in a sequence depends on its immediate predecessor. That is the tag for the current word depends up on the previous word and its tag. Here in the state sequence the tags are considered as states and the transition from one state to another state has a transition probability. The emission probability is the probability of observing a symbol in a particular state. In achieving this, we use viterbi algorithm. The basic tag set including the inflection is 53. Tamil being an agglutinative language, each word has different combinations of tags. Compound words are also used very often. So, the tagset increases to 350, as the combinations become high. A training corpus of 25000 words is trained and over 5000 words are tested. The raining corpus is tagged with the combination of basic tags and tags for inflection of the word. The evaluation gives encouraging result.

Part of Speech Tagging for Bengali with Hidden Markov Model, proceedings of the NLPAI ML Contest

2006

This report describes our work on Bengali Part-of-speech tagging (POS) for the NLPAI Machine Learning contest 2006. We use a Hidden Markov Model (HMM) based stochastic tagger. The tagger makes use of morphological and contextual information of words. Since only a small labeled training set is provided (41,000 words), a HMM based approach does not yield very good results. In this work, we have used a morphological analyzer to improve the performance of the tagger. Further, we have made use of semi-supervised learning by augmenting the small labeled training set provided with a larger unlabeled training set (100,000 words). The tagger has an accuracy of about 89% on the test data provided.

Discarding irrelevant parameters in hidden Markov model based part-of-speech taggers

A binary comparative definition of relevance, suggested by empirical results, gives a perfor- mance theory of relevance for hidden Markov models (HMMs) that makes it possible to re- duce the total number of parameters in the model and while improving overall perfor- mance of the model in a specific application domain. Generalizations of this view of rele- vance are meaningful in many AI subareas. Another view of this result is that there are at least two kinds of relevance. Knowledge of high quality is more relevant to a conclusion than low quality knowledge; specific knowl- edge is more relevant that general knowledge. This work argues that one can only be had at the expense of the other.

Persian part of speech tagger based on Hidden Markov Model

2008

This paper introduces the Persian Part of Speech (POS) tagger, based on the Hidden Markov Models (HMM). This POS tagger is part of the Persian Text-to-Speech (TTS) system called ParsGooyan. The tagger supports some properties of TTS systems, such as Break Phrase Detection, Homograph words Disambiguation, and Lexical Stress Search. A POS lexicon with 61,521 entries and 64,003 trigrams is used as the language model. It is implemented in Festival software and makes use of the Viterbi Decoder provided by Edinburgh Speech Tools. The average overall accuracy for this tagger is 95.11%. The accuracy of the known and unknown words is 96.136% and 60.25%, respectively.

A Hybrid POS Tagger for Khasi, an Under Resourced Language

International Journal of Advanced Computer Science and Applications

Khasi is an Austro-Asiatic language spoken mainly in the state of Meghalaya, India, and can be considered as an under resourced and under studied language from the natural language processing perspective. Part-of-speech (POS) tagging is one of the major initial requirements in any natural language processing tasks where part of speech is assigned automatically to each word in a sentence. Therefore, it is only natural to initiate the development of a POS tagger for Khasi and this paper presents the construction of a Hybrid POS tagger for Khasi. The tagger is developed to address the tagging errors of a Khasi Hidden Markov Model (HMM) POS tagger by integrating conditional random fields (CRF). This integration incorporates language features which are otherwise not feasible in an HMM POS tagger. The results of the Hybrid Khasi tagger have shown significant improvement in the tagger's accuracy as well as substantially reducing most of the tagging confusion of the HMM POS tagger.