Improved Pattern Learning for Bootstrapped Entity Extraction (original) (raw)

Content and Context: Two-Pronged Bootstrapped Learning for Regex-Formatted Entity Extraction

2018

Regular expressions are an important building block of rulebased information extraction systems. Regexes can encode rules to recognize instances of simple entities which can then feed into the identification of more complex cross-entity relationships. Manually crafting a regex that recognizes all possible instances of an entity is difficult since an entity can manifest in a variety of different forms. Thus, the problem of automatically generalizing manually crafted seed regexes to improve the recall of IE systems has attracted research attention. In this paper, we propose a bootstrapped approach to improve the recall for extraction of regex-formatted entities, with the only source of supervision being the seed regex. Our approach starts from a manually authored high precision seed regex for the entity of interest, and uses the matches of the seed regex and the context around these matches to identify more instances of the entity. These are then used to identify a set of diverse, hig...

Facilitating information extraction without annotated data using unsupervised and positive-unlabeled learning

AMIA ... Annual Symposium proceedings. AMIA Symposium, 2020

Information extraction (IE), the distillation of specific information from unstructured data, is a core task in natural language processing. For rare entities (<1% prevalence), collection of positive examples required to train a model may require an infeasibly large sample of mostly negative ones. We combined unsupervised- with biased positive-unlabeled (PU) learning methods to: 1) facilitate positive example collection while maintaining the assumptions needed to 2) learn a binary classifier from the biased positive-unlabeled data alone. We tested the methods on a real-life use case of rare (<0.42%) entity extraction from medical malpractice documents. When tested on a manually reviewed random sample of documents, the PU model achieved an area under the precision-recall curve of0.283 and Fj of 0.410, outperforming fully supervised learning (0.022 and 0.096, respectively). The results demonstrate our method's potential to reduce the manual effort required for extracting rar...

A Comprehensive Review of Bootstrapping Pattern Learning Techniques in Information Extraction

2021

The automatic extraction of information from unstructured sources has opened up new avenues for querying, organizing, and analyzing data by drawing upon the clean semantics of structured databases and the abundance of unstructured data. The field of information extraction has its genesis in the natural language processing community. The primary impetus came from competitions centered around recognizing named entities like people names and organizations from news articles. This paper surveys Bootstrapping Pattern Learning Techniques in Information Extraction and presents detailed talk of the algorithms and the relative pros and cons of the techniques used. We also study the problems various authors attempt to solve and analyze when it is appropriate to utilize a bootstrapping algorithm. Index Terms Information Extraction, Natural Language Processing, Bootstrapping Pattern Learning.

A New Data Representation Based on Training Data Characteristics to Extract Drug Named-Entity in Medical Text

2016

One essential task in information extraction from the medical corpus is drug name recognition. Compared with text sources come from other domains, the medical text is special and has unique characteristics. In addition, the medical text mining poses more challenges, e.g., more unstructured text, the fast growing of new terms addition, a wide range of name variation for the same drug. The mining is even more challenging due to the lack of labeled dataset sources and external knowledge, as well as multiple token representations for a single drug name that is more common in the real application setting. Although many approaches have been proposed to overwhelm the task, some problems remained with poor F-score performance (less than 0.75). This paper presents a new treatment in data representation techniques to overcome some of those challenges. We propose three data representation techniques based on the characteristics of word distribution and word similarities as a result of word emb...

Distributed Representations of Words to Guide Bootstrapped Entity Classifiers

Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2015

Bootstrapped classifiers iteratively generalize from a few seed examples or prototypes to other examples of target labels. However, sparseness of language and limited supervision make the task difficult. We address this problem by using distributed vector representations of words to aid the generalization. We use the word vectors to expand entity sets used for training classifiers in a bootstrapped pattern-based entity extraction system. Our experiments show that the classifiers trained with the expanded sets perform better on entity extraction from four online forums, with 30% F 1 improvement on one forum. The results suggest that distributed representations can provide good directions for generalization in a bootstrapping system.

SPOT the Drug! An Unsupervised Pattern Matching Method to Extract Drug Names from Very Large Clinical Corpora

2012 IEEE Second International Conference on Healthcare Informatics, Imaging and Systems Biology, 2012

Although structured electronic health records are becoming more prevalent, much information about patient health is still recorded only in unstructured text. "Understanding" these texts has been a focus of natural language processing (NLP) research for many years, with some remarkable successes, yet there is more work to be done. Knowing the drugs patients take is not only critical for understanding patient health (e.g., for drug-drug interactions or drug-enzyme interaction), but also for secondary uses, such as research on treatment effectiveness. Several drug dictionaries have been curated, such as RxNorm, FDA's Orange Book, or NCI, with a focus on prescription drugs. Developing these dictionaries is a challenge, but even more challenging is keeping these dictionaries up-to-date in the face of a rapidly advancing field-it is critical to identify grapefruit as a "drug" for a patient who takes the prescription medicine Lipitor, due to their known adverse interaction. To discover other, new adverse drug interactions, a large number of patient histories often need to be examined, necessitating not only accurate but also fast algorithms to identify pharmacological substances. In this paper we propose a new algorithm, SPOT, which identifies drug names that can be used as new dictionary entries from a large corpus, where a "drug" is defined as a substance intended for use in the diagnosis, cure, mitigation, treatment, or prevention of disease. Measured against a manually annotated reference corpus, we present precision and recall values for SPOT. SPOT is language and syntax independent, can be run efficiently to keep dictionaries up-to-date and to also suggest words and phrases which may be misspellings or uncatalogued synonyms of a known drug. We show how SPOT's lack of reliance on NLP tools makes it robust in analyzing clinical medical text. SPOT is a generalized bootstrapping algorithm, seeded with a known dictionary and automatically extracting the context within which each drug is mentioned. We define three features of such context: support, confidence and prevalence. Finally, we present the performance tradeoffs depending on the thresholds chosen for these features.

Boosting Biomedical Entity Extraction by Using Syntactic Patterns for Semantic Relation Discovery

2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, 2010

Biomedical entity extraction from unstructured web documents is an important task that needs to be performed in order to discover knowledge in the veterinary medicine domain. In general, this task can be approached by applying domainspecific ontologies, but a review of the literature shows that there is no universal dictionary, or ontology for this domain. To address this issue, we manually construct an ontology for extracting entities such as: animal disease names, viruses and serotypes. We then use an automated ontology expansion approach to extract semantic relationships between concepts. Such relationships include asserted synonymy, hyponymy and causality. Specifically, these relationships are extracted by using a set of syntactic patterns and part-of-speech tagging. The resulting ontology contains richer semantics compared to the manuallyconstructed ontology. We compare our approach for extracting synonyms, hyponyms and other disease related concepts, with an approach where the ontology is expanded using GoogleSets 1 , on the veterinary medicine entity extraction task. Experimental results show that our semantic relationship extraction approach produces a significant increase in precision and recall as compared to the GoogleSets approach.

Supervised Entity and Relation Extraction

We present a system for extracting entities and relations from documents: given a natural text document, identify and classify entities mentioned in the document (e.g. people, locations, etc.) and relations between these entities (e.g. person X lives in location Y). We designed separate systems for relation extraction given already-labeled entities, and for entity extraction from plain text, and then combined the two systems in a pipeline. We ran our system on a small set of sports articles and two larger sets containing biomedical and newswire articles. Both entity extraction and relation extraction are trained in a supervised manner using annotations in the datasets. For entity extraction these annotations allow us to train a conditional random field sequence classifier by matching annotated types to part of speech parse trees that are built from the text. For relation extraction we ran logistic regression using a set of syntactic and surface features of the sentence data. We eval...

YAPPIE—Learning information extraction patterns from unlabeled data

Abstract Motivation: A major goal in biomedical text mining is the extraction of biological entities, associations between them, and their respective mapping to database entries. One common and successful approach is to use sets of linguistic patterns that match, for instance, protein-protein interactions or gene-disease associations in a sentence. Pattern engineering is usually done by hand or relies on manually annotated corpora.

QuickUMLS: a fast, unsupervised approach for medical concept extraction

Entity extraction is a fundamental step in many health in-formatics systems. In recent years, tools such as MetaMap and cTAKES have been widely used for medical concept extraction on medical literature and clinical notes; however, relatively little interest has been placed on their scalabil-ity to large datasets. In this work, we present QuickUMLS: a fast, unsupervised, approximate dictionary matching algorithm for medical concept extraction. The proposed method achieves similar precision and recall of state-of-the-art systems on two clinical notes corpora, and outperforms MetaMap and cTAKES on a dataset of consumer drug reviews. More importantly, it is up to 135 times faster than both systems.