Using morphological analyzer to statistical POS Tagging on Persian Text (original) (raw)

Evaluation of statistical part of speech tagging of persian text

2007 9th International Symposium on Signal Processing and Its Applications, 2007

One of the fundamental tasks in natural language processing is part of speech (POS) tagging. A POS tagger is a piece of software that reads text in some language and assigns a part of speech tag to each one of the words. Our main interest in this research was to see how easy it is to apply methods used in a language such as English to a new and different language such as Persian and what would be the performance of such approaches. This paper presents evaluation of several part of speech tagging methods on Persian text. These are a statistical tagging method, a memory based tagging approach and two different versions of Maximum Likelihood Estimation (MLE) tagging on Persian text. The two MLE versions differ in the way they handle the unknown words. We also demonstrate the value of simple heuristics and post-processing in improving the accuracy of these methods. These experiments have been conducted on a manually part of speech tagged Persian corpus with over two million tagged words. The results of the experiments are encouraging and comparable with the other languages such as English, German or Spanish 1 .

A Farsi part-of-speech tagger based on Markov model

Proceedings of the 2008 ACM symposium on Applied computing - SAC '08, 2008

This paper describes a method based on morphological analysis of words for a Persian Part-Of-Speech (POS) tagging system. This is a main part of a process for expanding a large Persian corpus called Peyekare (or Textual Corpus of Persian Language). Peykare is arranged into two parts: annotated and unannotated parts. We use the annotated part in order to create an automatic morphological analyzer, a main segment of the system. Morphosyntactic features of Persian words cause two problems: the number of tags is increased in the corpus (586 tags) and the form of the words is changed. This high number of tags debilitates any taggers to work efficiently. From other side the change of word forms reduces the frequency of words with the same lemma; and the number of words belonging to a specific tag reduces as well. This problem also has a bad effect on statistical taggers. The morphological analyzer by removing the problems helps the tagger to cover a large number of tags in the corpus. Using a Markov tagger the method is evaluated on the corpus. The experiments show the efficiency of the method in Persian POS tagging.

Evaluation of part of speech tagging on Persian text

2007

One of the fundamental tasks in natural language processing is part of speech (POS) tagging. A POS tagger is a piece of software that reads text in some language and assigns a part of speech tag to each one of the words. Our main interest in this research was to see how easy it is to apply methods used in a language such as English to a new and different language such as Persian and what would be the performance of such approaches. This paper presents evaluation of several part of speech tagging methods on Persian text. These are a statistical tagging method, a memory based tagging approach and two different versions of Maximum Likelihood Estimation (MLE) tagging on Persian text. The two MLE versions differ in the way they handle the unknown words. We also demonstrate the value of simple heuristics and post-processing in improving the accuracy of these methods. These experiments have been conducted on a manually part of speech tagged Persian corpus with over two million tagged words. The results of the experiments are encouraging and comparable with the other languages such as English, German or Spanish 1 .

An Accurate Persian Part-of-Speech Tagger

Computer Systems Science and Engineering, 2020

The processing of any natural language requires that the grammatical properties of every word in that language are tagged by a part of speech (POS) tagger. To present a more accurate POS tagger for the Persian language, we propose an improved and accurate tagger called IAoM that supports properties of text to speech systems such as Lexical Stress Search, Homograph words Disambiguation, Break Phrase Detection, and main aspects of Persian morphology. IAoM uses Maximum Likelihood Estimation (MLE) to determine the tags of unknown words. In addition, it uses a few defined rules for the sake of achieving high accuracy. For tagging the input corpus, IAoM uses a Hidden Markov Model (HMM) alongside the Viterbi algorithm. To present a fair evaluation, we have performed various experiments on both homogeneous and heterogeneous Persian corpora and studied the effect of the size of training set on the accuracy of IAoM. Experimental results demonstrate the merit of the proposed tagger in achieving an overall accuracy of 97.6%.

Statistical POS tagging experiments on Persian text

Computational Approaches to Arabic …, 2007

Part-Of-Speech (POS) tagging is the process of marking-up the words in a text with their corresponding parts of speech. It is an essential part of text and natural language processing. There are many models and software for POS tagging in English and other ...

A Statistical Part-of-Speech Tagger for Persian

This paper presents the statistical part-ofspeech tagger HunPoS trained on a Persian corpus. The result of the experiments shows that HunPoS provides an overall accuracy of 96.9%, which is the best result reported for Persian part-of-speech tagging.

A new morphological lexicon and a POS tagger for the Persian Language

2011

In (Sagot and Walther, 2010), the authors introduce an advanced tokenizer and a morphological lexicon for the Persian language named PerLex. In this paper, we describe experiments dedicated to enriching this lexicon and using it for building a POS tagger for Persian. Natural Language Processing (NLP) tasks such as part-of-speech (POS) tagging or parsing as well as most NLP applications require large-scale lexical resources. Yet, such resources rarely are freely available, even though it is the fastest way to building high- ...

Two Run Morphological Analysis for POS Tagging of Untagged Words

Morphological Analysis is the process of inspecting the structure of a word with respect to its linguistic rules and semantics. They are also used to identify the Part-of-Speech tags of a given word. The task of analysis becomes extra tedious when the language is tightly structured and richly packed such as Tamil making analysis procedure a compelling endeavor. The current analyzer tools in practice mostly involve set of rules applied on them. These rules are however exhaustive and hence do not cover all tagging conditions. This work focuses on addressing one such condition that was lacking in the Analyzer tool for identifying Named Entities in Tamil Biomedicine, our previous work. The main objective here is to tag words that were omitted by the Analyzer due to lack of rules that might apply 100% to it. Tagging of words involves identifying related words based on region matching, running the Analyzer for the retrieved set of words and scoring it by similarity diameter. The 2-run Analyzer gave an accuracy of 85.9% on POS tagging of the untagged words.

Investigation on a Feasible Corpus for Persian POS Tagging

2000

One of the fundamental works in natural language processing is creating a feasible corpus for evaluating effectiveness of different algorithms. In this paper, the authors report creation of test corpus of automatic part of speech tagging purposes based on the Persian tagged corpus of Prof. Bijankhan. This study includes preprocessing , statistical analysis and experiments with simple statistical POS tagging