Development of a Dataset and a Deep Learning Baseline Named Entity Recognizer for Three Low Resource Languages: Bhojpuri, Maithili and Magahi (original) (raw)

A Survey on Various Methods used in Named Entity Recognition for Hindi Language

In many natural language processing usecases, Named Entity Recognition is a fundamental task for any pre-processing step. Majorly, its usefulness lies in speeding up the NLP applications. A significant part of the previous work for distinguishing named elements concentrated on feature engineering. This becomes a limiting factor for resource-scarce languages, as the same number of resources are not easily accessible. Numerous researchers utilize different procedures, for example, rule based, ML based and hybrid ways to deal with this issue. Deep learning based algorithms are being developed in large scale as an innovative approach now a days for the advanced NER models which will give the best results out of it. Creating an NER framework for Indian languages is a harder task as compared to other languages of the world. Hindi and numerous other Indian languages give some inborn troubles in numerous NLP related tasks. Structurally speaking, Indian languages contain numerous complex aspects like free-word ordering which does not allow n-gram based approaches for NER to work well with the Indian languages. Furthermore, the lack of capitalization in the language prevents hand-engineered approaches from working well. Many of the published deep learning models for Named Entity Recognition are still dependent on feature engineering as opposed to feature learning. This is at odds with the original intent of a deep learning model, that is to learn features from the data. Here we present a best-in-class survey covering different methodologies with their F measures for NER mainly in Hindi Language.

AMRITA@FIRE-2014: Named Entity Recognition for Indian Languages

This paper describes Named Entity Recognition systems for English, Hindi, Tamil and Malayalam languages. This paper presents our working methodology and results on Named Entity Recognition (NER) Task of FIRE-2014. In this work, English NER system is implemented based on Conditional Random Fields (CRF) which makes use of the model learned from the training corpus and feature representation, for tagging. The other systems for Hindi, Tamil and Malayalam are based on Support Vector Machine (SVM). In addition to training corpus, Tamil and Malayalam system uses gazetteer information.

HiNER: A Large Hindi Named Entity Recognition Dataset

arXiv (Cornell University), 2022

Named Entity Recognition (NER) is a foundational NLP task that aims to provide class labels like Person, Location, Organisation, Time, and Number to words in free text. Named Entities can also be multi-word expressions where the additional I-O-B annotation information helps label them during the NER annotation process. While English and European languages have considerable annotated data for the NER task, Indian languages lack on that front-both in terms of quantity and following annotation standards. This paper releases a significantly sized standard-abiding Hindi NER dataset containing 109,146 sentences and 2,220,856 tokens, annotated with 11 tags. We discuss the dataset statistics in all their essential detail and provide an in-depth analysis of the NER tag-set used with our data. The statistics of tag-set in our dataset show a healthy per-tag distribution, especially for prominent classes like Person, Location and Organisation. Since the proof of resource-effectiveness is in building models with the resource and testing the model on benchmark data and against the leader-board entries in shared tasks, we do the same with the aforesaid data. We use different language models to perform the sequence labelling task for NER and show the efficacy of our data by performing a comparative evaluation with models trained on another dataset available for the Hindi NER task. Our dataset helps achieve a weighted F1 score of 88.78 with all the tags and 92.22 when we collapse the tag-set, as discussed in the paper. To the best of our knowledge, no available dataset meets the standards of volume (amount) and variability (diversity), as far as Hindi NER is concerned. We fill this gap through this work, which we hope will significantly help NLP for Hindi. We release this dataset with our code and models for further research.

AMRITA@ FIRE-2014: Named Entity Recognition for Indian languages (Working notes)

2014

This paper describes Named Entity Recognition systems for English, Hindi, Tamil and Malayalam languages. This paper presents our working methodology and results on Named Entity Recognition (NER) Task of FIRE-2014. In this work, English NER system is implemented based on Conditional Random Fields (CRF) which makes use of the model learned from the training corpus and feature representation, for tagging. The other systems for Hindi, Tamil and Malayalam are based on Support Vector Machine (SVM). In addition to training corpus, Tamil and Malayalam system uses gazetteer information.

Development of a Hindi Named Entity Recognition System without Using Manually Annotated Training Corpus

International Arab Journal of Information Technology, 2018

Machine learning based approach for named entity recognition (NER) requires sufficient annotated corpus to train the classifier. Other NER resources like gazetteers are also required to make the classifier more accurate. But in many languages and domains relevant NER resources are still not available. Creation of adequate and relevant resources is costly and time consuming. However a large amount of resources and several NER systems are available in resource-rich languages, like English. Suitable language adaptation techniques, NER resources of a resource-rich language and minimally supervised learning might help to overcome such scenarios. In this paper we have studied a few such techniques in order to develop a Hindi NER system. Without using any Hindi NE annotated corpus we have achieved a reasonable accuracy of F-Measure 73.87 in the developed system.

Named Entity Recognition in Indian Languages: A Survey.

International Journal of Engineering Sciences & Research Technology, 2013

Named Entity Recognition (NER) is the process of determining and identifying all proper nouns into pre defined classes such as persons, places, organi but for Indian languages it is a difficult and challenging task and also limited due to lack of resources, but it has started to appear recently. In this paper we present a brief summary of We also explain the different techniques used in NER and literary work review on different languages done by different scientists. Named Entity Recognition (NER) play big part in various Natural language process task like machine translation, text to speech synthesis, natural language understanding, Information Extraction, Information retrieval, question answering etc. Named Entity recognition comes under the domain of "information extraction", which extracts specific kinds of data from records.

AsNER -- Annotated Dataset and Baseline for Assamese Named Entity recognition

arXiv (Cornell University), 2022

We present the AsNER, a named entity annotation dataset for low resource Assamese language with a baseline Assamese NER model. The dataset contains about 99k tokens comprised of text from the speech of the Prime Minister of India and Assamese play. It also contains person names, location names and addresses. The proposed NER dataset is likely to be a significant resource for deep neural based Assamese language processing. We benchmark the dataset by training NER models and evaluating using state-of-the-art architectures for supervised named entity recognition (NER) such as Fasttext, BERT, XLM-R, FLAIR, MuRIL etc. We implement several baseline approaches with state-of-the-art sequence tagging Bi-LSTM-CRF architecture. The highest F1-score among all baselines achieves an accuracy of 80.69% when using MuRIL as a word embedding method. The annotated dataset and the top performing model are made publicly available.

Survey of Named Entity Recognition Systems with respect to Indian and Foreign Languages

International Journal of Computer Applications, 2016

Named Entity Recognition (NER) is sub task of Information Extraction that includes identification of named entities and classification of them into named entity classes such as person, location and organization etc. NER can be used to preprocess textual information and convert it into structured form that can be useful for Information Retrieval, Machine Translation, Question Answering System and Text Summarization. This paper presents a survey regarding NER research done for various Indian and non Indian languages. The study and observations related to approaches, techniques and features required to implement NER for various languages especially for Indian languages is reported.

Named Entity Recognition for Indian Languages: A Survey

Named Entity Recognition (NER) is a sub task of Information Extraction (IE) used to identify and classify the names in any given data. Earlier studies were mostly based on hand written rules where as now-a-days Machine Learning models such as Hidden Markov Model (HMM), Maximum Entropy (MaxEnt), Maximum Entropy Markov model (MEMM), Support Vector Machine (SVM), Conditional Random Fields (CRFs) are used to develop NER systems. In this paper we are presenting a survey of NER's for Indian Languages which has been developed using the above learning algorithms. Work on NER has been done only for few Indian Languages like Bengali, Hindi, Tamil, Telugu, Oriya, Punjabi, Urdu and Kannada among 22 official languages. In this paper, we discuss the strategies adopted and performance of these NER'S with respect to recall, precision and F-measure. Key Words: Named Entity Recognition (NER), Information Extraction (IE), Hidden Markov Model (HMM), Maximum Entropy (MaxEnt), Maximum Entropy Markov model (MEMM), Support Vector Machine (SVM), Conditional Random Fields (CRFs).

A Survey on Various Approach used in Named Entity Recognition for Indian Languages

International Journal of Computer Applications

Named Entity Recognition (NER) is an application of Natural Language Processing (NLP). NER is a activity of Information Extraction. NER is a task used for automated text processing for various industries, key concept for academics, artificial intelligence, robotics, Bioinformatics and many more. NER is always essential when dealing with chief NLP activity such as machine translation, question-answering, document summarization etc. Most NER work has been done for other European languages. Among Indian constitutional languages, NER work has been done for few languages. Not enough work is possible due to some challenges such as lack of resources, ambiguity in language, morphologically rich and many more. In this paper, we found many challenges available in NER for Indian languages and compared by measuring standard evaluation metrics values of accuracy, precision, recall and F-measure.