D.: Using a morphological analyzer in high precision POS tagging of Hungarian (original) (raw)

Using a morphological analyzer in high precision POS tagging of Hungarian

Proceedings of LREC, 2006

The paper presents an evaluation of maxent POS disambiguation systems that incorporate an open source morphological analyzer to constrain the probabilistic models. The experiments show that the best proposed architecture, which is the first application of the maximum entropy framework in a Hungarian NLP task, outperforms comparable state of the art tagging methods and is able to handle out of vocabulary items robustly, allowing for efficient analysis of large (web-based) corpora.

BRILL'S POS TAGGER WITH EXTENDED LEXICAL TEMPLATES FOR HUNGARIAN

Proceedings of the Workshop (W01) on Machine …, 1999

In this paper Brill's rule-based PoS tagger is tested and adapted to Hungarian. It is shown that the present system does not obtain as high accuracy for Hungarian as it does for English because of the structural difference between these languages. Hungarian has rich morphology, is agglutinative with inflectional characteristics and has free word order. The tagger has the greatest difficulties with parts-of-speech belonging to open classes because of their complicated morphological structure. The accuracy of tagging can be increased from 83% to 97% by changing the rule generating mechanisms, namely the lexical templates in the lexical training module.

Using a Morphological Database to Increase the Accuracy in POS tagging

We experiment with extending the dictionaries used by three open-source part-of-speech taggers, by using data from a large Icelandic morphological database. We show that the accuracy of the taggers can be improved significantly by using the database. The reason is that the unknown word ratio reduces dramatically when adding data from the database to the taggers’ dictionaries. For the best performing tagger, the overall tagging accuracy increases from the base tagging result of 92.73% to 93.32%, when the unknown word ratio decreases from 6.8% to 1.1%. When we add reliable frequency information to the tag profiles for some of the words originating from the database, we are able to increase the accuracy further to 93.48% - this is equivalent to 10.3% error reduction compared to the base tagger.

Efficient stochastic part-of-speech tagging for Hungarian

Proc. of the Third LREC, 2002

Many of the methods developed for Western European languages and used widespread to produce annotated language resources cannot readily be applied to Central and Eastern European languages, due to the large number of novel phenomena exhibited in the syntax and morphology of these languages, which these methods have to handle but have not been designed to cope with. The process of morphological tagging when applied to Hungarian data to produce corpora annotated at least at the morphosyntactic level is most indicative of this problem: several of the algorithms (either rule-based or statistical) that have been used very successfully in other domains cannot readily be applied to a language exhibiting such a varied morphology and huge number of wordforms as Hungarian. The paper will describe a robust tagging scenario for Hungarian using a relatively simple stochastic system augmented with external morphological processing, which can overcome the two most conspcicuous problems: the complexity of morphosyntactic descriptions and most importantly the huge number of possible wordforms.

Designing HMM-Based Part-of-Speech Tagger for Lithuanian Language

Informatica

This paper describes a preliminary experiment in designing a Hidden Markov Model (HMM)-based part-of-speech tagger for the Lithuanian language. Part-of-speech tagging is the problem of assigning to each word of a text the proper tag in its context of appearance. It is accom- plished in two basic steps: morphological analysis and disambiguation. In this paper, we focus on the problem of disambiguation, i.e., on the problem of choosing the correct tag for each word in the context of a set of possible tags. We constructed a stochastic disambiguation algorithm, based on supervised learning techniques, to learn hidden Markov model's parameters from hand-annotated corpora. The Viterbi algorithm is used to assign the most probable tag to each word in the text.

Enriching the knowledge sources used in a maximum entropy part-of-speech tagger

2000

This paper presents results for a maximumentropy-based part of speech tagger, which achieves superior performance principally by enriching the information sources used for tagging. In particular, we get improved results by incorporating these features: (i) more extensive treatment of capitalization for unknown words; (ii) features for the disambiguation of the tense forms of verbs; (iii) features for disambiguating particles from prepositions and adverbs. The best resulting accuracy for the tagger on the Penn Treebank is 96.86% overall, and 86.91% on previously unseen words.

Improving part-of-speech tagging accuracy for Croatian by morphological analysis

This paper investigates several methods of combining a second order hidden Markov model part-ofspeech (morphosyntactic) tagger and a high-coverage inflectional lexicon for Croatian. Our primary motivation was to improve tagging accuracy of Croatian texts by using our newly-developed tagger CroTag, currently in beta-version. We also wanted to compare its tagging results -both standalone and utilizing the morphological lexicon -to the ones previously described in , provided by the TnT statistical tagger which we used as a reference point having in mind that both implement the same tagging procedure. At the beginning we explain the basic idea behind the experiment, its motivation and importance from the perspective of processing the Croatian language. We also describe tools -namely tagger and lexicon -and language resources used in the experiment, including their implementation method and input/output format details that were of importance. With the basics presented, we describe in theory four possible methods of combining these resources and tools with respect to their operating paradigm, input and production capabilities and then put these ideas to test using the F-measure evaluation framework. Results are then discussed in detail and conclusions and future work plans are presented.

Using morphological analyzer to statistical POS Tagging on Persian Text

Due to the growing number of textual resources available in digital form, the ability of understanding and processing them automatically has recently become critical. The first fundamental step in understanding these resources is the ability to identify the parts-of-speech of each given token or a word in the sentence in order to disambiguate them. Parts-of-speech (POS) tagging is one of the tools for understanding and processing of natural language and it is of infrastructural stages in some speech and text processing applications. Several methods have been presented for POS tagging that each one has been applied in taggers in order to achieve to a high performance and accuracy. Statistical methods have been of primary techniques and have acquired the most successful results in the field of natural language processing in recent years. This success also has been used in other areas of natural language and is very popular. One of the most important issues in POS tagging systems is identifying unknown words. In this paper, for identifying unknown words we have used morphological analyzer. Before the tagging, the words are checked morphologically and appropriate tag is assigned to the word, and thereby the overall accuracy is increased by using morphological analyzer. We have used 5-Fold cross validation technique for evaluating proposed tagger. Regarding to the obtained results of experiments, the use of text pre-processing and morphological analyzer in the proposed POS Tagger is very effective and demonstrates the performance of the POS Tagging system.