An Accurate Persian Part-of-Speech Tagger (original) (raw)

Persian part of speech tagger based on Hidden Markov Model

2008

This paper introduces the Persian Part of Speech (POS) tagger, based on the Hidden Markov Models (HMM). This POS tagger is part of the Persian Text-to-Speech (TTS) system called ParsGooyan. The tagger supports some properties of TTS systems, such as Break Phrase Detection, Homograph words Disambiguation, and Lexical Stress Search. A POS lexicon with 61,521 entries and 64,003 trigrams is used as the language model. It is implemented in Festival software and makes use of the Viterbi Decoder provided by Edinburgh Speech Tools. The average overall accuracy for this tagger is 95.11%. The accuracy of the known and unknown words is 96.136% and 60.25%, respectively.

Evaluation of statistical part of speech tagging of persian text

2007 9th International Symposium on Signal Processing and Its Applications, 2007

One of the fundamental tasks in natural language processing is part of speech (POS) tagging. A POS tagger is a piece of software that reads text in some language and assigns a part of speech tag to each one of the words. Our main interest in this research was to see how easy it is to apply methods used in a language such as English to a new and different language such as Persian and what would be the performance of such approaches. This paper presents evaluation of several part of speech tagging methods on Persian text. These are a statistical tagging method, a memory based tagging approach and two different versions of Maximum Likelihood Estimation (MLE) tagging on Persian text. The two MLE versions differ in the way they handle the unknown words. We also demonstrate the value of simple heuristics and post-processing in improving the accuracy of these methods. These experiments have been conducted on a manually part of speech tagged Persian corpus with over two million tagged words. The results of the experiments are encouraging and comparable with the other languages such as English, German or Spanish 1 .

Evaluation of part of speech tagging on Persian text

2007

A hidden Markov model for Persian part-of-speech tagging

Procedia Computer Science, 2011

One of the important actions in the processing of languages is part-of-speech tagging. Against of this importance, although numerous models have been presented in different languages but there is few works have been done in Persian language. In this paper, a part-of-speech tagging system on Persian corpus by using hidden Markov model is proposed. Achieving to this goal, the main aspects of Persian morphology is introduced and developed. To evaluate the accuracy of proposed approach, this approach is applied in simulations which are done on both homogeneous and heterogeneous Persian corpus. Getting results with 98.1% accuracy in the experiments demonstrate the suitable efficiency of the proposed approach on Persian corpus.

A Statistical Part-of-Speech Tagger for Persian

This paper presents the statistical part-ofspeech tagger HunPoS trained on a Persian corpus. The result of the experiments shows that HunPoS provides an overall accuracy of 96.9%, which is the best result reported for Persian part-of-speech tagging.

A Farsi part-of-speech tagger based on Markov model

Proceedings of the 2008 ACM symposium on Applied computing - SAC '08, 2008

This paper describes a method based on morphological analysis of words for a Persian Part-Of-Speech (POS) tagging system. This is a main part of a process for expanding a large Persian corpus called Peyekare (or Textual Corpus of Persian Language). Peykare is arranged into two parts: annotated and unannotated parts. We use the annotated part in order to create an automatic morphological analyzer, a main segment of the system. Morphosyntactic features of Persian words cause two problems: the number of tags is increased in the corpus (586 tags) and the form of the words is changed. This high number of tags debilitates any taggers to work efficiently. From other side the change of word forms reduces the frequency of words with the same lemma; and the number of words belonging to a specific tag reduces as well. This problem also has a bad effect on statistical taggers. The morphological analyzer by removing the problems helps the tagger to cover a large number of tags in the corpus. Using a Markov tagger the method is evaluated on the corpus. The experiments show the efficiency of the method in Persian POS tagging.

Using Heuristic Rules to Improve Persian Part of Speech Tagging Accuracy

2008

Processing is determining a word's part of speech (POS) tag. In this research we focus on improving the accuracy of Persian part of speech tagging by applying post processing heuristic rules. To evaluate the effects of those rules we use Bijankhan tagged corpus and for tagging, Maximum Likelihood Estimation (MLE) approach is selected because of its simplicity and the ease of implementation. Furthermore, we have studied the effect of size of training on the accuracy of the MLE method. The experimental results show that the heuristic rules improve the accuracy especially for the unknown words 1 .

Using morphological analyzer to statistical POS Tagging on Persian Text

Due to the growing number of textual resources available in digital form, the ability of understanding and processing them automatically has recently become critical. The first fundamental step in understanding these resources is the ability to identify the parts-of-speech of each given token or a word in the sentence in order to disambiguate them. Parts-of-speech (POS) tagging is one of the tools for understanding and processing of natural language and it is of infrastructural stages in some speech and text processing applications. Several methods have been presented for POS tagging that each one has been applied in taggers in order to achieve to a high performance and accuracy. Statistical methods have been of primary techniques and have acquired the most successful results in the field of natural language processing in recent years. This success also has been used in other areas of natural language and is very popular. One of the most important issues in POS tagging systems is identifying unknown words. In this paper, for identifying unknown words we have used morphological analyzer. Before the tagging, the words are checked morphologically and appropriate tag is assigned to the word, and thereby the overall accuracy is increased by using morphological analyzer. We have used 5-Fold cross validation technique for evaluating proposed tagger. Regarding to the obtained results of experiments, the use of text pre-processing and morphological analyzer in the proposed POS Tagger is very effective and demonstrates the performance of the POS Tagging system.

Investigation on a Feasible Corpus for Persian POS Tagging

2000

One of the fundamental works in natural language processing is creating a feasible corpus for evaluating effectiveness of different algorithms. In this paper, the authors report creation of test corpus of automatic part of speech tagging purposes based on the Persian tagged corpus of Prof. Bijankhan. This study includes preprocessing , statistical analysis and experiments with simple statistical POS tagging

An Accurate Persian Part-of-Speech Tagger (original) (raw)

Related papers