The design and implementation of a part of speech tagger for english (original) (raw)

Computational Analysis of Part of Speech Tagging

2012

In order, to make text a suitable input to an automatic method of information extraction it is usually transformed from unstructured source of information into a structured format. Part of Speech Tagging is one of the preprocessing steps which assign one of the parts of speech to the given word. In this paper we had discussed various models of supervised and unsupervised technique shown the comparison of various techniques based on accuracy, and experimentally compared the results obtained in models of Condition Random Field and Maximum Entropy model. We had deployed a model of part of speech tagger for which we had compared the results with other models. The developed is based on HMM approach and had shown good results in terms of efficiency in comparison with other models.

Tagging Accuracy Analysis on Part-of-Speech Taggers

Journal of Computer and Communications, 2014

Part of Speech (POS) Tagging can be applied by several tools and several programming languages. This work focuses on the Natural Language Toolkit (NLTK) library in the Python environment and the gold standard corpora installable. The corpora and tagging methods are analyzed and compared by using the Python language. Different taggers are analyzed according to their tagging accuracies with data from three different corpora. In this study, we have analyzed Brown, Penn Treebank and NPS Chat corpuses. The taggers we have used for the analysis are; default tagger, regex tagger, n-gram taggers. We have applied all taggers to these three corpuses, resultantly we have shown that whereas Unigram tagger does the best tagging in all corpora, the combination of taggers does better if it is correctly ordered. Additionally, we have seen that NPS Chat Corpus gives different accuracy results than the other two corpuses.

Analysis of Part of Speech Tagging

2012

In the area of text mining, Natural Language Processing is an emerging field. As text is an unstructured source of information, to make it a suitable input to an automatic method of information extraction it is usually transformed into a structured format. Part of Speech Tagging is one of the preprocessing steps which perform semantic analysis by assigning one of the parts of speech to the given word. In this paper we had discussed various models of supervised and unsupervised technique shown the comparison of various techniques based on accuracy, and experimentally compared the results obtained in models of Supervised Condition Random Field and Supervised Maximum Entropy model. We had deployed a model of part of speech tagger based on Hidden Markov Model approach and had compare the results with other models. Also we had discussed the problem occurring with supervised part of speech tagging.

Probabilistic Part Of Speech Tagging for Bahasa Indonesia

2009

Abstract In this paper we report our work in developing Part of Speech Tagging for Bahasa Indonesia using probabilistic approaches. We use Condtional Random Fields (CRF) and Maximum Entropy methods in assigning the tag to a word. We use two tagsets containing 37 and 25 part-of-speech tags for Bahasa Indonesia. In this work we compared both methods using using two different corpora. The results of the experiments show that the Maximum Entropy method gives the best result.

Markov random field based English part-of-speech tagging system

Proceedings of the 16th conference on Computational linguistics -, 1996

Probabilistic models have been widely used for natural language processing. Part-of-speech tagging, which assigns the most likely tag to each word in a given sentence, is one. of tire problems which can be solved by statisticM approach. Many researchers haw~ tried to solve the problem by hidden Marker model (HMM), which is well known as one of the statistical models. But it has many difficulties: integrating heterogeneous information, coping with data sparseness prohlem, and adapting to new environments. In this paper, we propose a Markov radom field (MRF) model based approach to the tagging problem. The MRF provides the base frame to combine various statistical information with maximum entropy (ME) method. As Gibbs distribution can be used to describe a posteriori probability of tagging, we use it in ma.ximum a posteriori (MAP) estimation of optimizing process. Besides, several tagging models are developed to show the effect of adding information. Experimental results show that the performance of the tagger gets improved as we add more statistical information, and that Mt{F-based tagging model is better than ttMM based tagging model in data sparseness problem.

Performance analysis of a part of speech tagging task

Proceedings of the 4th international conference on …, 2003

In this paper, we attempt to make a formal analysis of the performance in automatic part of speech tagging. Lower and upper bounds in tagging precision using existing taggers or their combination are provided. Since we show that with existing taggers, automatic perfect tagging is not possible, we offer two solutions for applications requiring very high precision: (1) a solution involving minimum human intervention for a precision of over 98.7%, and (2) a combination of taggers using a memory based learning algorithm that succeeds in reducing the error rate with 11.6% with respect to the best tagger involved.

Unsupervised Part-of-Speech Tagging in the Large

Research on Language and Computation, 2009

Syntactic preprocessing is a step that is widely used in NLP applications. Traditionally, rule-based or statistical Part-of-Speech (POS) taggers are employed that either need considerable rule development times or a sufficient amount of manually labeled data. To alleviate this acquisition bottleneck and to enable preprocessing for minority languages and specialized domains, a method is presented that constructs a statistical syntactic tagger model from a large amount of unlabeled text data. The method presented here is called unsupervised POS-tagging, as its application results in corpus annotation in a comparable way to what POS-taggers provide. Nevertheless, its application results in slightly different categories as opposed to what is assumed by a linguistically motivated POS-tagger. These differences hamper evaluation procedures that compare the output of the unsupervised POS-tagger to a tagging with a supervised tagger. To measure the extent to which unsupervised POS-tagging can contribute in application-based settings, the system is evaluated in supervised POStagging, word sense disambiguation, named entity recognition and chunking. Unsupervised POS-tagging has been explored since the beginning of the 1990s. Unlike in previous approaches, the kind and number of different tags is here generated by the method itself. Another difference to other methods is that not all words above a certain frequency rank get assigned a tag, but the method is allowed to exclude words from the clustering, if their distribution does not match closely enough with other words. The lexicon size is considerably larger than in previous approaches, resulting in a lower out-of-vocabulary (OOV) rate and in a more consistent tagging. The system presented here is available for download as open-source software along with tagger models for several languages, so the contributions of this work can be easily incorporated into other applications.

Revisiting the principles behind stochastic part-of-speech tagging from a theoretical linguistic perspective

2018

Part-of-speech (POS) taggers serve as the foundation of almost any NLP technology. Since the beginning of the 1990s, when the Penn Tree Bank project set the principles behind its annotated corpus, the standard taggers adopted those principles as the standard. Indeed, a deeper look at tagging results of the common taggers reveals that despite some minor strengths and weaknesses that each of them presents, all perform quite similarly and tend to make the same mistakes. Attempting to improve the taggers outcome, it was clear that some of the fundamental principles behind their operation should be revisited. The current article examines the validity of these principles from a theoretical linguistics perspective and presents a way to adapt the POS tagging results to the linguistic reality without modifying the probabilistic algorithms, namely, by applying pre- and post-tagging linguistic-based rules on the original input and on the automatic tagging results correspondingly.