A Practical Incremental Learning Framework For Sparse Entity Extraction (original) (raw)

Accelerating the annotation of sparse named entities by dynamic sentence selection

BMC Bioinformatics, 2008

Background: Previous studies of named entity recognition have shown that a reasonable level of recognition accuracy can be achieved by using machine learning models such as conditional random fields or support vector machines. However, the lack of training data (i.e. annotated corpora) makes it difficult for machine learning-based named entity recognizers to be used in building practical information extraction systems. Results: This paper presents an active learning-like framework for reducing the human effort required to create named entity annotations in a corpus. In this framework, the annotation work is performed as an iterative and interactive process between the human annotator and a probabilistic named entity tagger. Unlike active learning, our framework aims to annotate all occurrences of the target named entities in the given corpus, so that the resulting annotations are free from the sampling bias which is inevitable in active learning approaches. Conclusion: We evaluate our framework by simulating the annotation process using two named entity corpora and show that our approach can reduce the number of sentences which need to be examined by the human annotator. The cost reduction achieved by the framework could be drastic when the target named entities are sparse.

Named Entity Recognition for Partially Annotated Datasets

arXiv (Cornell University), 2022

The most common Named Entity Recognizers are usually sequence taggers trained on fully annotated corpora, i.e. the class of all words for all entities is known. Sequence taggers are fast to train and to make predictions. Partially annotated corpora, i.e. some but not all entities of some types are annotated, are too noisy for training sequence taggers since the same entity may be annotated one time with its true type but not another time, misleading the tagger. Therefore, we are comparing three training strategies for partially annotated datasets and an approach to derive new datasets for new classes of entities from Wikipedia without time-consuming manual data annotation. In order to properly verify that our data acquisition and training approaches are plausible, we manually annotated test datasets for two new classes, namely food and drugs, and report the resulting performance of all trained models on these test datasets.

A Little Annotation does a Lot of Good: A Study in Bootstrapping Low-resource Named Entity Recognizers

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019

Most state-of-the-art models for named entity recognition (NER) rely on the availability of large amounts of labeled data, making them challenging to extend to new, lowerresourced languages. However, there are now several proposed approaches involving either cross-lingual transfer learning, which learns from other highly resourced languages, or active learning, which efficiently selects effective training data based on model predictions. This paper poses the question: given this recent progress, and limited human annotation, what is the most effective method for efficiently creating high-quality entity recognizers in under-resourced languages? Based on extensive experimentation using both simulated and real human annotation, we find a dualstrategy approach best, starting with a crosslingual transferred model, then performing targeted annotation of only uncertain entity spans in the target language, minimizing annotator effort. Results demonstrate that cross-lingual transfer is a powerful tool when very little data can be annotated, but an entity-targeted annotation strategy can achieve competitive accuracy quickly, with just one-tenth of training data. The code is publicly available here. 1

GERBIL} -- General Entity Annotation Benchmark Framework

The need to bridge between the unstructured data on the Document Web and the structured data on the Web of Data has led to the development of a considerable number of annotation tools. However, these tools are currently still hard to compare since the published evaluation results are calculated on diverse datasets and evaluated based on different measures. We present GERBIL, an evaluation framework for semantic entity annotation. The rationale behind our framework is to provide developers, end users and researchers with easy-to-use interfaces that allow for the agile, fine-grained and uniform evaluation of annotation tools on multiple datasets. By these means, we aim to ensure that both tool developers and end users can derive meaningful insights pertaining to the extension, integration and use of annotation applications. In particular, GERBIL provides comparable results to tool developers so as to allow them to easily discover the strengths and weaknesses of their implementations with respect to the state of the art. With the permanent experiment URIs provided by our framework, we ensure the reproducibility and archiving of evaluation results. Moreover, the framework generates data in machineprocessable format, allowing for the efficient querying and post-processing of evaluation results. Finally, the tool diagnostics provided by GERBIL allows deriving insights pertaining to the areas in which tools should be further refined, thus allowing developers to create an informed agenda for extensions and end users to detect the right tools for their purposes. GERBIL aims to become a focal point for the state of the art, driving the research agenda of the community by presenting comparable objective evaluation results.

Supervised Entity and Relation Extraction

We present a system for extracting entities and relations from documents: given a natural text document, identify and classify entities mentioned in the document (e.g. people, locations, etc.) and relations between these entities (e.g. person X lives in location Y). We designed separate systems for relation extraction given already-labeled entities, and for entity extraction from plain text, and then combined the two systems in a pipeline. We ran our system on a small set of sports articles and two larger sets containing biomedical and newswire articles. Both entity extraction and relation extraction are trained in a supervised manner using annotations in the datasets. For entity extraction these annotations allow us to train a conditional random field sequence classifier by matching annotated types to part of speech parse trees that are built from the text. For relation extraction we ran logistic regression using a set of syntactic and surface features of the sentence data. We eval...

Entity Extraction within Plain-Text Collections WISE 2013 Challenge - T1: Entity Linking Track

Lecture Notes in Computer Science, 2013

The increasing availability of electronic texts, such as freecontent encyclopedias on the internet, has unveiled vast interesting and important knowledge in Web 2.0. Nevertheless, to identify relations within a myriad of information is still a challenge. For large corpus of data, it is impractical to manually label each text in order to define relations to extract information. The WISE 2013 conference proposed a challenge (T1 Track) in which teams must label entities within plain texts based on a given set of entities. The Wikilinks dataset comprise 40 million mentions over 3 million entities. This paper describe a straightforward twofold unsupervised strategy to extract and tag entities, aiming to achieve accurate results in the identification of proper nouns and concrete concepts, regardless the domain. The proposed solution is based on a pipeline of text processing modules that includes a lexical parser. In order to validate the proposed solution, we statistically evaluate the results using various measurements in the case study supplied by the Challenge.

Improved Pattern Learning for Bootstrapped Entity Extraction

Proceedings of the Eighteenth Conference on Computational Natural Language Learning, 2014

Bootstrapped pattern learning for entity extraction usually starts with seed entities and iteratively learns patterns and entities from unlabeled text. Patterns are scored by their ability to extract more positive entities and less negative entities. A problem is that due to the lack of labeled data, unlabeled entities are either assumed to be negative or are ignored by the existing pattern scoring measures. In this paper, we improve pattern scoring by predicting the labels of unlabeled entities. We use various unsupervised features based on contrasting domain-specific and general text, and exploiting distributional similarity and edit distances to learned entities. Our system outperforms existing pattern scoring algorithms for extracting drug-andtreatment entities from four medical forums.

Reducing class imbalance during active learning for named entity annotation

In lots of natural language processing tasks, the classes to be dealt with often occur heavily imbalanced in the underlying data set and classifiers trained on such skewed data tend to exhibit poor performance for low-frequency classes. We introduce and compare different approaches to reduce class imbalance by design within the context of active learning (AL). Our goal is to compile more balanced data sets up front during annotation time when AL is used as a strategy to acquire training material. We situate our approach in the context of named entity recognition. Our experiments reveal that we can indeed reduce class imbalance and increase the performance of classifiers on minority classes while preserving a good overall performance in terms of macro F-score.

EAGER: Extending Automatically Gazetteers for Entity Recognition

Key to named entity recognition, the manual gazetteering of entity lists is a costly, errorprone process that often yields results that are incomplete and suffer from sampling bias. Exploiting current sources of structured information, we propose a novel method for extending minimal seed lists into complete gazetteers. Like previous approaches, we value WIKIPEDIA as a huge, well-curated, and relatively unbiased source of entities. However, in contrast to previous work, we exploit not only its content, but also its structure, as exposed in DBPEDIA. We extend gazetteers through Wikipedia categories, carefully limiting the impact of noisy categorizations. The resulting gazetteers easily outperform previous approaches on named entity recognition.

FAMIE: A Fast Active Learning Framework for Multilingual Information Extraction

Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: System Demonstrations

This paper presents FAMIE, a comprehensive and efficient active learning (AL) toolkit for multilingual information extraction. FAMIE is designed to address a fundamental problem in existing AL frameworks where annotators need to wait for a long time between annotation batches due to the time-consuming nature of model training and data selection at each AL iteration. This hinders the engagement, productivity, and efficiency of annotators. Based on the idea of using a small proxy network for fast data selection, we introduce a novel knowledge distillation mechanism to synchronize the proxy network with the main large model (i.e., BERT-based) to ensure the appropriateness of the selected annotation examples for the main model. Our AL framework can support multiple languages. The experiments demonstrate the advantages of FAMIE in terms of competitive performance and time efficiency for sequence labeling with AL. We publicly release our code (https://github.com/ nlp-uoregon/famie) and demo website (http://nlp.uoregon.edu:9000/). A demo video for FAMIE is provided at: https://youtu.be/I2i8n\_jAyrY.