Combining Proper Name-Coreference with Conditional Random Fields for Semi-supervised Named Entity Recognition in Vietnamese Text (original) (raw)

Semi-supervised Learning for Vietnamese Named Entity Recognition using Online Conditional Random Fields

Proceedings of the Fifth Named Entity Workshop, 2015

We present preliminary results for the named entity recognition problem in the Vietnamese language. For this task, we build a system based on conditional random fields and address one of its challenges: how to combine labeled and unlabeled data to create a stronger system. We propose a set of features that is useful for the task and conduct experiments with different settings to show that using bootstrapping with an online learning algorithm called Margin Infused Relaxed Algorithm increases the performance of the models.

Named Entity Recognition for Vietnamese documents using semi-supervised learning method of CRFs with Generalized Expectation Criteria

Named Entity Recognition (NER) is an important, useful task in many natural language processing applications and much previous work in NER has been done in many other languages such as English, Japanese, Chinese… However, Vietnamese NER task is still relatively new and challenge due to the characteristics of Vietnamese, the lack of a large annotated corpus… This paper presents a new approach for Vietnamese NER -a semi-supervised training method for Conditional random fields (CRFs) models using generalized expectation criteria to express a preference for parameter settings. We perform several experiments using different feature setting and different training data to show the high performance of this method and compare to the other method.

Named entity recognition in Vietnamese documents

2007

Named Entity Recognition (NER) aims to classify words in a document into pre-defined target entity classes and is now considered to be fundamental for many natural language processing tasks such as information retrieval, machine translation, information extraction and question answering. This paper presents the results of an experiment in which a Support Vector Machine (SVM) based NER model is applied to the Vietnamese language. Though this state of the art machine learning method has been widely applied to NER in several well-studied languages, this is the first time this method has been applied to Vietnamese. In a comparison against Conditional Random Fields (CRFs) the SVM model was shown to outperform CRF by optimizing its feature window size, obtaining an overall F-score of 87.75. The paper also presents a detailed discussion about the characteristics of the Vietnamese language and provides an analysis of the factors which influence performance in this task.

"Chinese Named Entity Recognition with Conditional Random Fields in the Light of Chinese Characteristics"

This paper introduces the research works of Chinese named entity recognition (CNER) including person name, organization name and location name. To differ from the conventional approaches that usually introduce more about the used algorithms with less discussion about the CNER problem itself, this paper firstly conducts a study of the Chinese characteristics and makes a discussion of the different feature sets; then a promising comparison result is shown with the optimized features and concise model. Furthermore, different performances are analyzed of various features and algorithms employed by other researchers. To facilitate the further researches, this paper provides some formal definitions about the issues in the CNER with potential solutions. Following the SIGHAN bakeoffs, the experiments are performed in the closed track but the problems of the open track tasks are also discussed. 1

Iterative Named Entity Recognition with Conditional Random Fields

Applied Sciences

Named entity recognition (NER) constitutes an important step in the processing of unstructured text content for the extraction of information as well as for the computer-supported analysis of large amounts of digital data via machine learning methods. However, NER often relies on domain-specific knowledge, being conducted manually in a time- and human-resource-intensive process. These can be reduced with statistical models performing NER automatically. The current work investigates whether Conditional Random Fields (CRF) can be efficiently trained for NER in German texts, by means of an iterative procedure combining self-learning with a manual annotation–active learning–component. The training dataset increases continuously with the iterative procedure. Whilst self-learning did not markedly improve the performance of the CRF for NER, the manual annotation of sentences with the lowest probability of correct prediction clearly improved the model F1-score and simultaneously reduced the...

An approach for named entity recognition in poorly structured data

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2012

This paper describes an approach for the task of named entity recognition in structured data containing free text as the values of its elements. We studied the recognition of the entity types of person, location and organization in bibliographic data sets from a concrete wide digital library initiative. Our approach is based on conditional random fields models, using features designed to perform named entity recognition in the absence of strong lexical evidence, and exploiting the semantic context given by the data structure. The evaluation results support that, with the specialized features, named entity recognition can be done in free text within structured data with an acceptable accuracy. Our approach was able to achieve a maximum precision of 0.91 at 0.55 recall and a maximum recall of 0.82 at 0.77 precision. The achieved results were always higher than those obtained with Stanford Named Entity Recognizer, which was developed for grammatically well-formed text. We believe this level of quality in named entity recognition allows the use of this approach to support a wide range of information extraction applications in structured data.

Named Entity Recognition for Vietnamese

Named Entity Recognition is an important task but is still relatively new for Vietnamese. It is partly due to the lack of a large annotated corpus. In this paper, we present a systematic approach in building a named entity annotated corpus while at the same time building rules to recognize Vietnamese named entities. The resulting open source system achieves an F-measure of 83%, which is better compared to existing Vietnamese NER systems.

Improving the scalability of semi-Markov conditional random fields for named entity recognition

Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the ACL - ACL '06, 2006

This paper presents techniques to apply semi-CRFs to Named Entity Recognition tasks with a tractable computational cost. Our framework can handle an NER task that has long named entities and many labels which increase the computational cost. To reduce the computational cost, we propose two techniques: the first is the use of feature forests, which enables us to pack feature-equivalent states, and the second is the introduction of a filtering process which significantly reduces the number of candidate states. This framework allows us to use a rich set of features extracted from the chunk-based representation that can capture informative characteristics of entities. We also introduce a simple trick to transfer information about distant entities by embedding label information into nonentity labels. Experimental results show that our model achieves an F-score of 71.48% on the JNLPBA 2004 shared task without using any external resources or post-processing techniques.

Chinese named entity recognition with conditional probabilistic models

This paper describes the work on Chinese named entity recognition performed by Yahoo team at the third International Chinese Language Processing Bakeoff. We used two conditional probabilistic models for this task, including condi-tional random fields (CRFs) and maxi-mum entropy models. In particular, we trained two conditional random field rec-ognizers and one maximum entropy rec-ognizer for identifying names of people, places, and organizations in un-segmented Chinese texts. Our best per-formance is 86.2% F-score on MSRA dataset, and 88.53% on CITYU dataset.

NAMED ENTITY IDENTIFICATION AND CLASSIFICATION USING MACHINE LEARNING TECHNIQUES

ABSTRACT-In this paper, we discuss Named Entity Identification and Classification Using Machine Learning Techniques for Telugu Language. Identifying Named Entities is also known as Named Entity Recognition (NER).Named Entity Recognition is a task in which entities like proper nouns and numerical information is extracted from documents and classified into predefined categories such as person names, organization names, place, date, time, miscellaneous names. Here, in this paper we are using hybrid approach i.e. combination of rule based approach and one of the machine learning techniques (conditional random fields algorithm) to develop named entity recognizer. The identification and classification of entities often involve ambiguities. In order to resolve the ambiguities we have to choose the most appropriate tag from the valid tags available for the entities. In our system we are trying to improve the accuracy using CRF's. If we use solely a rule-based approach we can process very fast using pre-defined rules but ambiguity cannot be resolved and if we use solely machine learning technique it can process using annotated training data but maintaining training data is difficult. So in our proposed system we are using a hybrid approach to develop the accuracy of the system. Initially input is given in the form of paragraphs which are mana