Named Entity Recognition in Low-resource Languages using Cross-lingual distributional word representation (original) (raw)
Related papers
Improving Low Resource Named Entity Recognition using Cross-lingual Knowledge Transfer
Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, 2018
Neural networks have been widely used for high resource language (e.g. English) named entity recognition (NER) and have shown state-of-the-art results. However, for low resource languages, such as Dutch and Spanish, due to the limitation of resources and lack of annotated data, NER models tend to have lower performances. To narrow this gap, we investigate cross-lingual knowledge to enrich the semantic representations of low resource languages. We first develop neural networks to improve low resource word representations via knowledge transfer from high resource language using bilingual lexicons. Further, a lexicon extension strategy is designed to address out-of lexicon problem by automatically learning semantic projections. Finally, we regard word-level entity type distribution features as an external languageindependent knowledge and incorporate them into our neural architecture. Experiments on two low resource languages (Dutch and Spanish) demonstrate the effectiveness of these additional semantic representations (average 4.8% improvement). Moreover, on Chinese OntoNotes 4.0 dataset, our approach achieves an F-score of 83.07% with 2.91% absolute gain compared to the state-of-the-art systems.
Named Entity Recognition Modeling for the Thai Language from a Disjointedly Labeled Corpus
2018
In the Thai language, named entity can be used with or without a prefix or an indication of word. This may cause confusion between named entity and other types of noun. However, a named entity is likely to be used in adjacent to verbs or prepositions. This means that the adjacent verbs or prepositions to a noun can be as a good feature to determiner the type of named entity. There are some studies on named entity recognition (NER) task in other languages such as Indonesian showing that combination of word embedding and part-of-speech (POS) tag can improve the performance of the NER model. In this paper, we investigate the Thai Named Entity Recognition task using Bi-LSTM model with word embedding and POS embedding for dealing with the relatively small and disjointedly labeled corpus. We compare our model with the one without POS tag, and the baseline model of CRF with the similar set of feature. The experiment results show that our proposed model outperforms the other two in all F1-score measures. Especially, in the case of location file, the F1-score is increased by 14 percent.
NaijaNER : Comprehensive Named Entity Recognition for 5 Nigerian Languages
ArXiv, 2021
Most of the common applications of Named Entity Recognition (NER) is on English and other highly available languages. In this work, we present our findings on Named Entity Recognition for 5 Nigerian Languages (Nigerian English, Nigerian Pidgin English, Igbo, Yoruba and Hausa). These languages are considered low-resourced, and very little openly available Natural Language Processing work has been done in most of them. In this work, individual NER models were trained and metrics recorded for each of the languages. We also worked on a combined model that can handle Named Entity Recognition (NER) for any of the five languages. The combined model works well for Named Entity Recognition(NER) on each of the languages and with better performance compared to individual NER models trained specifically on annotated data for the specific language. The aim of this work is to share our learning on how information extraction using Named Entity Recognition can be optimized for the listed Nigerian L...
Named Entity Recognition for a Low Resource Language
International Journal of Recent Technology and Engineering (IJRTE), 2019
Kokborok named entity recognition using the rules based approach is being studied in this paper. Named entity recognition is one of the applications of natural language processing. It is considered a subtask for information extraction. Named entity recognition is the means of identifying the named entity for some specific task. We have studied the named entity recognition system for the Kokborok language. Kokborok is the official language of the state of Tripura situated in the north eastern part of India. It is also widely spoken in other part of the north eastern state of India and adjoining areas of Bangladesh. The named entities are like the name of person, organization, location etc. Named entity recognitions are studied using the machine learning approach, rule based approach or the hybrid approach combining the machine learning and rule based approaches. Rule based named entity recognitions are influence by the linguistic knowledge of the language. Machine learning approach r...
16th Conference on Computer Science and Intelligence Systems, 2021
Neural Network (NN) models produce state-of-the-art results for natural language processing tasks. Further, NN models are used for sequence tagging tasks on low-resourced languages with good results. However, the findings are not consistent for all low-resourced languages, and many of these languages have not been sufficiently evaluated. Therefore, in this paper, transformer NN models are used to evaluate named-entity recognition for ten low-resourced South African languages. Further, these transformer models are compared to other NN models and a Conditional Random Fields (CRF) Machine Learning (ML) model. The findings show that the transformer models have the highest F-scores with more than a 5% performance difference from the other models. However, the CRF ML model has the highest average F-score. The transformer model’s greater parallelization allows low-resourced languages to be trained and tested with less effort and resource costs. This makes transformer models viable for low-resourced languages. Future research could improve upon these findings by implementing a linear-complexity recurrent transformer variant.
Using Empirically Constructed Lexical Resources for Named Entity Recognition
Biomedical Informatics Insights, 2013
Because of privacy concerns and the expense involved in creating an annotated corpus, the existing small-annotated corpora might not have sufficient examples for learning to statistically extract all the named-entities precisely. In this work, we evaluate what value may lie in automatically generated features based on distributional semantics when using machine-learning named entity recognition (NER). The features we generated and experimented with include n-nearest words, support vector machine (SVM)-regions, and term clustering, all of which are considered distributional semantic features. The addition of the n-nearest words feature resulted in a greater increase in F-score than by using a manually constructed lexicon to a baseline system. Although the need for relatively small-annotated corpora for retraining is not obviated, lexicons empirically derived from unannotated text can not only supplement manually created lexicons, but also replace them. This phenomenon is observed in ...
A preliminary evaluation of word representations for named-entity recognition
2009
Abstract We use different word representations as word features for a named-entity recognition (NER) system with a linear model. This work is part of a larger empirical survey, evaluating different word representations on different NLP tasks. We evaluate Brown clusters, Collobert and Weston (2008) embeddings, and HLBL (Mnih & Hinton, 2009) embeddings of words.
CebuaNER: A New Baseline Cebuano Named Entity Recognition Model
arXiv (Cornell University), 2023
Despite being one of the most linguistically diverse groups of countries, computational linguistics and language processing research in Southeast Asia has struggled to match the level of countries from the Global North. Thus, initiatives such as open-sourcing corpora and the development of baseline models for basic language processing tasks are important stepping stones to encourage the growth of research efforts in the field. To answer this call, we introduce CEBUANER, a new baseline model for named entity recognition (NER) in the Cebuano language. Cebuano is the second mostused native language in the Philippines with over 20 million speakers. To build the model, we collected and annotated over 4,000 news articles, the largest of any work in the language, retrieved from online local Cebuano platforms to train algorithms such as Conditional Random Field and Bidirectional LSTM. Our findings show promising results as a new baseline model, achieving over 70% performance on precision, recall, and F1 across all entity tags as well as potential efficacy in a crosslingual setup with Tagalog.
Arabic Named Entity Recognition using Word Representations
Recent work has shown the effectiveness of the word representations features in significantly improving supervised NER for the English language. In this study we investigate whether word representations can also boost supervised NER in Arabic. We use word representations as additional features in a Conditional Random Field (CRF) model and we systematically compare three popular neural word embedding algorithms (SKIP-gram, CBOW and GloVe) and six different approaches for integrating word representations into NER system. Experimental results show that Brown Clustering achieves the best performance among the six approaches. Concerning the word embedding features, the clustering embedding features outperform other embedding features and the distributional prototypes produce the second best result. Moreover, the combination of Brown clusters and word embedding features provides additional improvement in F1-score over the baseline.
2016
The paper presents results on Russian named entities classification and equivalent named entities retrieval using word and phrase representations. It is shown that a word or an expression’s context vector is an efficient feature to be used for predicting the type of a named entity. Distributed word representations are now claimed (and on a reasonable basis) to be one of the most promising distributional semantics models. In the described experiment on retrieving similar named entities the results go further than retrieving named entities of the same type or named entities-individuals of the same class: it is shown that equivalent variants of a named entity can be extracted. This result contributes to the task of unsupervised entities and semantic relations clustering and can be used for paraphrase search and automatic ontology population. The models were trained with word2vec on the Russian segment of parallel corpora used for statistical machine translation. Vector representations ...