Assessment of disease named entity recognition on a corpus of annotated sentences (original) (raw)

NCBI disease corpus: A resource for disease name recognition and concept normalization

2014

Information encoded in natural language in biomedical literature publications is only useful if efficient and reliable ways of accessing and analyzing that information are available. Natural language processing and text mining tools are therefore essential for extracting valuable information, however, the development of powerful, highly effective tools to automatically detect central biomedical concepts such as diseases is conditional on the availability of annotated corpora. This paper presents the disease name and concept annotations of the NCBI disease corpus, a collection of 793 PubMed abstracts fully annotated at the mention and concept level to serve as a research resource for the biomedical natural language processing community. Each PubMed abstract was manually annotated by two annotators with disease mentions and their corresponding concepts in Medical Subject Headings (MeSH Ò ) or Online Mendelian Inheritance in Man (OMIM Ò ). Manual curation was performed using PubTator, which allowed the use of pre-annotations as a pre-step to manual annotations. Fourteen annotators were randomly paired and differing annotations were discussed for reaching a consensus in two annotation phases. In this setting, a high inter-annotator agreement was observed. Finally, all results were checked against annotations of the rest of the corpus to assure corpus-wide consistency.

Disease Named Entity Recognition Using NCBI Corpus

2016

Named Entity Recognition (NER) in biomedical literature is a very active research area. NER is a crucial component of biomedical text mining because it allows for information retrieval, reasoning and knowledge discovery. Much research has been carried out in this area using semantic type categories, such as “DNA”, “RNA”, “proteins” and “genes”. However, disease NER has not received its needed attention yet, specifically human disease NER. Traditional machine learning approaches lack the precision for disease NER, due to their dependence on token level features, sentence level features and the integration of features, such as orthographic, contextual and linguistic features. In this paper a method for disease NER is proposed which utilizes sentence and token level features based on Conditional Random Fields using the NCBI disease corpus. Our system utilizes rich features including orthographic, contextual, affixes, bigrams, part of speech and stem based features. Using these feature ...

An improved corpus of disease mentions in PubMed citations

aclweb.org

The latest discoveries on diseases and their diagnosis/treatment are mostly disseminated in the form of scientific publications. However, with the rapid growth of the biomedical literature and a high level of variation and ambiguity in disease names, the task of retrieving disease-related articles becomes increasingly challenging using the traditional keywordbased approach. An important first step for any disease-related information extraction task in the biomedical literature is the disease mention recognition task. However, despite the strong interest, there has not been enough work done on disease name identification, perhaps because of the difficulty in obtaining adequate corpora. Towards this aim, we created a large-scale disease corpus consisting of 6900 disease mentions in 793 PubMed citations, derived from an earlier corpus. Our corpus contains rich annotations, was developed by a team of 12 annotators (two people per annotation) and covers all sentences in a PubMed abstract. Disease mentions are categorized into Specific Disease, Disease Class, Composite Mention and Modifier categories. When used as the gold standard data for a state-of-the-art machine-learning approach, significantly higher performance can be found on our corpus than the previous one. Such characteristics make this disease name corpus a valuable resource for mining disease-related information from biomedical text. The NCBI corpus is available for download at http://www.ncbi.nlm.nih.gov/CBBresearch/Fe llows/Dogan/disease.html.

Recognition and normalization of disease mentions in PubMed abstracts

2015

The rapidly increasing number of available PubMed documents calls the need for an automatic approach in the identification and normalization of disease mentions in order to increase the precision and effectivity of information retrieval. We herein describe our team’s participation for the Disease Named Entity Recognition and Normalization subtask under the chemical-disease relations track of the BioCreative V shared task. We developed a CRF-based model using BIESO tagging format to allow automated recognition of disease entities in PubMed abstracts. Recognized disease entities were normalized to MeSH concepts using a dictionary look-up method based on Lucene. Performance is reported using precision, recall and F-measure on three separate runs. Our best run achieved F-measure of 80.74% on disease mention recognition and 67.85 % on disease normalization.

Unsupervised method for automatic construction of a disease dictionary from a large free text collection

AMIA ... Annual Symposium proceedings / AMIA Symposium. AMIA Symposium, 2008

Concept specific lexicons (e.g. diseases, drugs, anatomy) are a critical source of background knowledge for many medical language-processing systems. However, the rapid pace of biomedical research and the lack of constraints on usage ensure that such dictionaries are incomplete. Focusing on disease terminology, we have developed an automated, unsupervised, iterative pattern learning approach for constructing a comprehensive medical dictionary of disease terms from randomized clinical trial (RCT) abstracts, and we compared different ranking methods for automatically extracting con-textual patterns and concept terms. When used to identify disease concepts from 100 randomly chosen, manually annotated clinical abstracts, our disease dictionary shows significant performance improvement (F1 increased by 35-88%) over available, manually created disease terminologies.

Introduction: named entity recognition in biomedicine

Journal of Biomedical Informatics, 2004

This special issue responds to the increasing interest of the biomedical community in text mining techniques. This is an exciting time for the text processing community, as there is an urgent need for text mining tools and methods in the biomedical domain. The amount of biological literature published daily is growing exponentially. Medline alone contains 14 million abstracts and is a critical source of information for biologists and curators. As these scientists find it essential to search for information in an overabundance of documents, their need for text mining techniques tailored to the biological domain has become apparent. The focus of this special issue is on named entity recognition (NER) in biomedicine, a fundamental challenge for text mining due to the special problems caused by the complex nature of biological entity recognition, classification, and unique identification. This is a key factor for access to the information stored in literature, as it is the biological entities and their relationships that convey knowledge across scientific articles. Textual terms (names of genes, proteins, gene products, organisms, drugs, chemical compounds, etc.) are the primary means of scientific communication because they are used in language to represent the concepts in the domain; it would be impossible to ''understand'' an article or to extract information from it without the precise identification and association of the terms. Biomedical terminology presents a special challenge. It is constantly changing; new terms are rapidly being introduced for each of the organisms being studied, while old ones are discarded (e.g., withdrawn or made obsolete). Biological names are very complex, as they are created and referenced by many different communities. They include an enormous amount of synonyms and variant forms, such as acronyms, and morphological, derivational, and orthographic variants, all of which are used interchangeably in the literature. In addition, many biological terms and their variants are ambiguous. They share their lexical representations with common English words (gene names/symbols, such as an, by, can, and

A Re-Evaluation of Biomedical Named Entity–Term Relations

Journal of Bioinformatics and Computational Biology, 2010

Text mining can support the interpretation of the enormous quantity of textual data produced in biomedical field. Recent developments in biomedical text mining include advances in the reliability of the recognition of named entities (NEs) such as specific genes and proteins, as well as movement toward richer representations of the associations of NEs. We argue that this shift in representation should be accompanied by the adoption of a more detailed model of the relations holding between NEs and other relevant domain terms. As a step toward this goal, we study NE–term relations with the aim of defining a detailed, broadly applicable set of relation types based on accepted domain standard concepts for use in corpus annotation and domain information extraction approaches.

Natural language processing pipelines to annotate BioC collections with an application to the NCBI disease corpus

Database : the journal of biological databases and curation, 2014

BioC is a new format and associated code libraries for sharing text and annotations. We have implemented BioC natural language preprocessing pipelines in two popular programming languages: C++ and Java. The current implementations interface with the well-known MedPost and Stanford natural language processing tool sets. The pipeline functionality includes sentence segmentation, tokenization, part-of-speech tagging, lemmatization and sentence parsing. These pipelines can be easily integrated along with other BioC programs into any BioC compliant text mining systems. As an application, we converted the NCBI disease corpus to BioC format, and the pipelines have successfully run on this corpus to demonstrate their functionality. Code and data can be downloaded from http://bioc.sourceforge.net. Database URL: http://bioc.sourceforge.net.

The CALBC silver standard corpus for biomedical named entities–a study in harmonizing the contributions from four independent named entity taggers

2010

The production of gold standard corpora is time-consuming and costly. We propose an alternative: the ‚silver standard corpus‗ (SSC), a corpus that has been generated by the harmonisation of the annotations that have been delivered from a selection of annotation systems. The systems have to share the type system for the annotations and the harmonisation solution has use a suitable similarity measure for the pair-wise comparison of the annotations. The annotation systems have been evaluated against the harmonised set (630.324 sentences, 15,956,841 tokens). We can demonstrate that the annotation of proteins and genes shows higher diversity across all used annotation solutions leading to a lower agreement against the harmonised set in comparison to the annotations of diseases and species. An analysis of the most frequent annotations from all systems shows that a high agreement amongst systems leads to the selection of terms that are suitable to be kept in the harmonised set. This is the first large-scale approach to generate an annotated corpus from automated annotation systems. Further research is required to understand, how the annotations from different systems have to be combined to produce the best annotation result for a harmonised corpus.

Additional file 1 of A clinical trials corpus annotated with UMLS entities to enhance the access to evidence-based medicine

2021

Background: The large volume of medical literature makes it difficult for healthcare professionals to keep abreast of the latest studies that support Evidence-Based Medicine. Natural language processing enhances the access to relevant information, and gold standard corpora are required to improve systems. To contribute with a new dataset for this domain, we collected the Clinical Trials for Evidence-Based Medicine in Spanish (CT-EBM-SP) corpus. Methods: We annotated 1200 texts about clinical trials with entities from the Unified Medical Language System semantic groups: anatomy (ANAT), pharmacological and chemical substances (CHEM), pathologies (DISO), and lab tests, diagnostic or therapeutic procedures (PROC). We doubly annotated 10% of the corpus and measured interannotator agreement (IAA) using F-measure. As use case, we run medical entity recognition experiments with neural network models. Results: This resource contains 500 abstracts of journal articles about clinical trials and 700 announcements of trial protocols (292 173 tokens). We annotated 46 699 entities (13.98% are nested entities). Regarding IAA agreement, we obtained an average F-measure of 85.65% (±4.79, strict match) and 93.94% (±3.31, relaxed match). In the use case experiments, we achieved recognition results ranging from 80.28% (±00.99) to 86.74% (±00.19) of average F-measure. Conclusions: Our results show that this resource is adequate for experiments with state-of-the-art approaches to biomedical named entity recognition. It is freely distributed at: http:// www. lllf. uam. es/ ESP/ nlpme dterm_ en. html. The methods are generalizable to other languages with similar available sources.

Assessment of disease named entity recognition on a corpus of annotated sentences (original) (raw)

Related papers