An empirical study of maximum entropy approach for part-of-speech tagging of Vietnamese texts (original) (raw)

An Experimental Investigation of Part-Of-Speech Taggers for Vietnamese

Part-of-speech (POS) tagging plays an important role in Natural Language Processing (NLP). Its applications can be found in many other NLP tasks such as named entity recognition, syntactic parsing, dependency parsing and text chunking. In the investigation conducted in this paper, we utilize the techniques of two widely-used toolkits, ClearNLP and Stanford POS Tagger, and develop two new POS taggers for Vietnamese, then compare them to three well-known Vietnamese taggers, namely JVnTagger, vnTagger and RDRPOSTagger. We make a systematic comparison to find out the tagger having the best performance. We also design a new feature set to measure the performance of the statistical taggers. Our new taggers built from Stanford Tagger and ClearNLP with the new feature set can outperform all other current Vietnamese taggers in term of tagging accuracy. Moreover, we also analyze the affection of some features to the performance of statistical taggers. Lastly, the experimental results also reveal that the transformation-based tagger, RDRPOSTagger, can run faster than any statistical tagger significantly.

Enriching the knowledge sources used in a maximum entropy part-of-speech tagger

2000

This paper presents results for a maximumentropy-based part of speech tagger, which achieves superior performance principally by enriching the information sources used for tagging. In particular, we get improved results by incorporating these features: (i) more extensive treatment of capitalization for unknown words; (ii) features for the disambiguation of the tense forms of verbs; (iii) features for disambiguating particles from prepositions and adverbs. The best resulting accuracy for the tagger on the Penn Treebank is 96.86% overall, and 86.91% on previously unseen words.

Improving Vietnamese POS tagging by integrating a rich feature set and Support Vector Machines

2008

Word segmentation and POS tagging are two important problems included in many NLP tasks. They, however, have not drawn much attention of Vietnamese researchers all over the world. In this paper, we focus on the integration of advantages from several resourses to improve the accuracy of Vietnamese word segmentation as well as POS tagging task. For word segmentation, we propose a solution in which we try to utilize multiple knowledge resources including dictionary-based model, N-gram model, and named entity recognition model and then integrate them into a Maximum Entropy model. The result of experiments on a public corpus has shown its effectiveness in comparison with the best current models. We got 95.30% F1 measure. For POS tagging, motivated from Chinese research and Vietnamese characteristics, we present a new kind of features based on the idea of word composition. We call it morpheme-based features. Our experiments based on two POS-tagged corpora showed that morpheme-based features always give promising results. In the best case, we got 89.64% precision on a Vietnamese POS-tagged corpus when using Maximum Entropy model. Key Words: word segmentation, Natural language processing (NLP), dictionary-based model, Named Entity model, N-gram model, morpheme-based feature, word-based feature, POS tagging

A Two-Stage Approach to Chinese Part-of-Speech Tagging

2008

This paper describes a Chinese part-ofspeech tagging system based on the maximum entropy model. It presents a novel two-stage approach to using the part-ofspeech tags of the words on both sides of the current word in Chinese part-of-speech tagging. The system is evaluated on four corpora at the Fourth SIGHAN Bakeoff in the close track of the Chinese part-ofspeech tagging task.

A Semi-supervised Learning Method for Vietnamese Part-of-Speech Tagging

2010

This paper presents a semi-supervised learning method for Vietnamese part of speech tagging. We take into account two powerful tagging models including Conditional Random Fields (CRFs) and the Guided Online-Learning models (GLs) as base learning models. We then propose a semisupervised learning tagging model for both CRFs and GLs methods. The main idea is to use of a word-cluster model as an associate source for enrich the feature space of discriminate learning models for both training and decoding processes. Experimental results on Vietnamese Tree-bank data (VTB) showed that the proposed method is effective. Our best model achieved accuracy of 94.10% when tested on VTB, and 92.60% an independent test.

POS-Taggging Malay Corpus: A Novel Approach Based on Maximum Entropy

International Journal of Engineering & Technology, 2018

Jawi and Roman scripts are represented Malay language. In the past, Jawi writings are widely used by the Malay community and foreigners; and it can be seen in the old documents. Old documents face the risk of background damage. In order to preserve this valuable information, there are significant needs to automated Jawi materials. Based on previous literature, POS-tags are known as the first phase in the automated text analysis; and the development of language technologies can barely initiate without this phase. We highlight the existing POS-tags approaches; and suggest the development of Malay Jawi POS-tags using extended ME-based approach on NUWT Corpus. Results have shown that the proposed model yielded a higher accuracy in comparison to the state-of-the-art model.

Probabilistic Part Of Speech Tagging for Bahasa Indonesia

2009

Abstract In this paper we report our work in developing Part of Speech Tagging for Bahasa Indonesia using probabilistic approaches. We use Condtional Random Fields (CRF) and Maximum Entropy methods in assigning the tag to a word. We use two tagsets containing 37 and 25 part-of-speech tags for Bahasa Indonesia. In this work we compared both methods using using two different corpora. The results of the experiments show that the Maximum Entropy method gives the best result.

A comparative study on different techniques for Thai part-of-speech tagging

2013 10th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology, 2013

The natural language processing (NLP) for Thai language is rather complicated using in the real tasks because it has a complex sequential structure of the sentence. The POS tagging can improve the accuracy of syntactic analysis so it can support the improvement of many NLP tasks. We present the supervised machine learning that is suitable for annotate the POS type for Thai language by comparison between the Support Vector Machine (SVM) and the Conditional Random Fields (CRFs). The BEST 2012 News and Entertainments corpus is utilized in our experiments. However, the sequential characteristic of Thai language is the interesting point and we use it as our feature in training set. Our sequential features contain forward 3-gram, backward 3-gram and 5-gram. The best accuracy of our experiments is 93.638% from SVMs POS tagging that learning by word of forward 3-gram when the size of training data is ten thousand tokens. Moreover, with the same training data, the best accuracy of CRFs is very close with SVM that is 93.254% when the learning form is the word with POS of 5-gram.

A Hybrid Approach for Part-of-Speech Tagging of Burmese Texts

2011 International Conference on Computer and Management (CAMAN), 2011

In Myanmar to English language translation system, in order to provide meaningful sentence from one language to another is non-trivial task. POS tagging is used as an early stage of linguistic text analysis in many applications. POS tagging is a process of assigning correct syntactic categories to each word. Tagsets and word disambiguation rules are fundamental parts of any POS tagger. This paper presents a new approach for POS tagging of Myanmar Language. Firstly, Users input a simple Myanmar sentence and then this sentence is segmented into words by using segmentation rules. These words are assigned to appropriate syntactic categories of Myanmar language by using rule based and probabilistic approach. This system applied CRF method for tagging POS ambiguities on words. CRF is a framework for building discriminative probabilistic models for segmenting and labeling sequential data. The tagsets for Myanmar POS, segmentation rule, tagging algorithm and CRF method are designed. The proposed approach is used UCSM Lexicon. So, this hybrid approach for POS tagging can give the optimal accuracy and robustness of machine translation system.

Vietnamese treebank construction and entropy-based error detection

Treebanks, especially the Penn treebank for natural language processing (NLP) in English, play an essential role in both research into and the application of NLP. However, many languages still lack treebanks and building a treebank can be very complicated and difficult. This work has a twofold objective. Firstly, to share our results in constructing a large Vietnamese treebank (VTB) with three levels of annotation including word segmentation, part-of-speech tagging, and syntactic analysis. Major steps in the treebank construction process are described with particular regard to specific Vietnamese properties such as lack of word delimiter and isolation. Those properties make sentences highly syntactically ambiguous, and therefore it is difficult to ensure a high level of agreement among annotators. Various studies of Vietnamese syntax were employed not only to define annotations but also to systematically deal with ambiguities. Annotators were supported by automatic labelling tools, which are based on statistical machine learning methods, for sentence pre-processing and a tree editor for supporting manual annotation. As a result, an annotation agreement of around 90 % was achieved. Our second objective is to present our method for automatically finding errors and inconsistencies in