Chemical named entities recognition: a review on approaches and applications (original) (raw)

1 Data and Text-Mining CheNER : Chemical Named Entity Recognizer

2013

Motivation: Chemical named entity recognition is used to automatically identify mentions to chemical compounds in text, and is the basis for more elaborate information extraction. However, only a small number of applications are freely available to identify such mentions. Particularly challenging and useful is the identification of IUPAC chemical compounds, which due to the complex morphology of IUPAC names requires more advanced techniques than that of brand names. Results: We present CheNER, a tool for automated identification of systematic IUPAC chemical mentions. We evaluated different systems using an established literature corpus to show that CheNER has a superior performance in identifying IUPAC names specifically, and that it makes better use of computational resources. Availability: http://metres.udl.cat/index.php/9-download/4-chener, http://ubio.bioinfo.cnio.es/biotools/CheNER/ Supplementary information: Both web sites above include the user manual for the software. Supple...

Survey on Information Extraction from Chemical Compound Literatures: Techniques and Challenges 1,4

2014

Chemical documents, especially those involving drug information, comprise a variety of types-the most common being journal articles, patents and theses. They typically contain large amounts of chemical information, such as PubMed-ID, activity classes and adverse or side effects. Techniques are used to extract information from a huge number of documents and it is presented in a useful structurally prepared format that can be applied to structured, semi-structured and unstructured texts. Numerous information extraction methods and techniques have been proposed and implemented. In principle, there are two main approaches to information extraction, the knowledge engineering approach and the learning approach. In this survey study, we first provide the historical background on information extraction approaches applied to chemical documents and discuss several kinds of information extraction tasks that have emerged in recent years. Then, we discuss the metrics used for evaluating informat...

Survey on Information Extraction from Chemical Compound Literatures : Techniques and Challenges

2014

Chemical documents, especially those involving drug information, comprise a variety of types – the most common being journal articles, patents and theses. They typically contain large amounts of chemical information, such as PubMed-ID, activity classes and adverse or side effects. Techniques are used to extract information from a huge number of documents and it is presented in a useful structurally prepared format that can be applied to structured, semi-structured and unstructured texts. Numerous information extraction methods and techniques have been proposed and implemented. In principle, there are two main approaches to information extraction, the knowledge engineering approach and the learning approach. In this survey study, we first provide the historical background on information extraction approaches applied to chemical documents and discuss several kinds of information extraction tasks that have emerged in recent years. Then, we discuss the metrics used for evaluating inform...

A dictionary-and grammar-based chemical named entity recognizer

The past decade has seen a massive increase in the number of chemical-related publications. Automatic identification and extraction of compounds and drugs mentioned in these publications can greatly benefit drug discovery research. The BioCreative CHEMDNER task focuses on recognizing and ranking mentions of these compounds in text (CDI) and extracting mention locations (CEM). We investigated an ensemble approach where dictionary-based named entity recognition is used along with grammar-based recognizers to extract compounds from text. Using an open source indexing system, we assessed the performance of ten different commercial and publicly available lexical resources in combination with three different chemical compound recognizers. The best combination along with a set of regular expressions was used to extract the compounds. To rank the different compounds found in a text, a normalized ratio of frequency of mention of chemical terms in chemical and non-chemical journals was calculated. When tested on the training data, our final system obtained an F-score of 88.5% for the CDI task, and 81.0% for the CEM task.

RESEARCH Open Access The CHEMDNER corpus of chemicals and drugs and its annotation principles

2015

The automatic extraction of chemical information from text requires the recognition of chemical entity mentions as one of its key steps. When developing supervised named entity recognition (NER) systems, the availability of a large, manually annotated text corpus is desirable. Furthermore, large corpora permit the robust evaluation and comparison of different approaches that detect chemicals in documents. We present the CHEMDNER corpus, a collection of 10,000 PubMed abstracts that contain a total of 84,355 chemical entity mentions labeled manually by expert chemistry literature curators, following annotation guidelines specifically defined for this task. The abstracts of the CHEMDNER corpus were selected to be representative for all major chemical disciplines. Each of the chemical entity mentions was manually labeled according to its structure-associated chemical entity mention (SACEM) class: abbreviation, family, formula, identifier, multiple, systematic and trivial. The difficulty...

CheNER: a tool for the identification of chemical entities and their classes in biomedical literature

Journal of cheminformatics, 2015

Small chemical molecules regulate biological processes at the molecular level. Those molecules are often involved in causing or treating pathological states. Automatically identifying such molecules in biomedical text is difficult due to both, the diverse morphology of chemical names and the alternative types of nomenclature that are simultaneously used to describe them. To address these issues, the last BioCreAtIvE challenge proposed a CHEMDNER task, which is a Named Entity Recognition (NER) challenge that aims at labelling different types of chemical names in biomedical text. To address this challenge we tested various approaches to recognizing chemical entities in biomedical documents. These approaches range from linear Conditional Random Fields (CRFs) to a combination of CRFs with regular expression and dictionary matching, followed by a post-processing step to tag those chemical names in a corpus of Medline abstracts. We named our best performing systems CheNER. We evaluate the...

The CHEMDNER corpus of chemicals and drugs and its annotation principles

Journal of cheminformatics, 2015

The automatic extraction of chemical information from text requires the recognition of chemical entity mentions as one of its key steps. When developing supervised named entity recognition (NER) systems, the availability of a large, manually annotated text corpus is desirable. Furthermore, large corpora permit the robust evaluation and comparison of different approaches that detect chemicals in documents. We present the CHEMDNER corpus, a collection of 10,000 PubMed abstracts that contain a total of 84,355 chemical entity mentions labeled manually by expert chemistry literature curators, following annotation guidelines specifically defined for this task. The abstracts of the CHEMDNER corpus were selected to be representative for all major chemical disciplines. Each of the chemical entity mentions was manually labeled according to its structure-associated chemical entity mention (SACEM) class: abbreviation, family, formula, identifier, multiple, systematic and trivial. The difficulty...

NLM-Chem, a new resource for chemical entity recognition in PubMed full text literature

Scientific Data, 2021

Automatically identifying chemical and drug names in scientific publications advances information access for this important class of entities in a variety of biomedical disciplines by enabling improved retrieval and linkage to related concepts. While current methods for tagging chemical entities were developed for the article title and abstract, their performance in the full article text is substantially lower. However, the full text frequently contains more detailed chemical information, such as the properties of chemical compounds, their biological effects and interactions with diseases, genes and other chemicals. We therefore present the NLM-Chem corpus, a full-text resource to support the development and evaluation of automated chemical entity taggers. the NLM-Chem corpus consists of 150 full-text articles, doubly annotated by ten expert NLM indexers, with ~5000 unique chemical name annotations, mapped to ~2000 MeSH identifiers. We also describe a substantially improved chemical entity tagger, with automated annotations for all of PubMed and PMC freely accessible through the Pubtator web-based interface and aPI. the NLM-Chem corpus is freely available.

Mining lexico-syntactic patterns to extract chemical entities with their associated properties

2013 INTERNATIONAL CONFERENCE ON COMPUTING, ELECTRICAL AND ELECTRONIC ENGINEERING (ICCEEE), 2013

Specific information on newly discovered compound is often difficult to be found in chemical databases. The chemical and drug literature is very rich with the information resulted from new chemical synthesis. This paper presents a survey on the types of approaches that have been used to extract information associated with chemical compounds from chemical and drug text. Thereafter, it gives a description for a novel pattern-based extraction method to be developed in the future taking into account specific types of information associated with chemical compounds not explored before in the automated extraction from a text. The paper focuses on the extraction of the properties that influence the bioavailability of drug candidates' compounds. The result of this study can help the database curators in compiling the drug related chemical databases and the researchers to digest the huge amount of textual information which is growing rapidly. Index Terms-Information extraction, chemical compounds, chemical databases, pattern-based approach. I.

NLM-Chem-BC7: manually annotated full-text resources for chemical entity annotation and indexing in biomedical articles

Database, 2022

The automatic recognition of chemical names and their corresponding database identifiers in biomedical text is an important first step for many downstream text-mining applications. The task is even more challenging when considering the identification of these entities in the article's full text and, furthermore, the identification of candidate substances for that article's metadata [Medical Subject Heading (MeSH) article indexing]. The National Library of Medicine (NLM)-Chem track at BioCreative VII aimed to foster the development of algorithms that can predict with high quality the chemical entities in the biomedical literature and further identify the chemical substances that are candidates for article indexing. As a result of this challenge, the NLM-Chem track produced two comprehensive, manually curated corpora annotated with chemical entities and indexed with chemical substances: the chemical identification corpus and the chemical indexing corpus. The NLM-Chem BioCreative VII (NLM-Chem-BC7) Chemical Identification corpus consists of 204 full-text PubMed Central (PMC) articles, fully annotated for chemical entities by 12 NLM indexers for both span (i.e. named entity recognition) and normalization (i.e. entity linking) using MeSH. This resource was used for the training and testing of the Chemical Identification task to evaluate the accuracy of algorithms in predicting chemicals mentioned in recently published full-text articles. The NLM-Chem-BC7 Chemical Indexing corpus consists of 1333 recently published PMC articles, equipped with chemical substance indexing by manual experts at the NLM. This resource was used for the evaluation of the Chemical Indexing task, which evaluated the accuracy of algorithms in predicting the chemicals that should be indexed, i.e. appear in the listing of MeSH terms for the document. This set was further enriched after the challenge in two ways: (i) 11 NLM indexers manually verified each of the candidate terms appearing in the prediction results of the challenge participants, but not in the MeSH indexing, and the chemical indexing terms appearing in the MeSH indexing list, but not in the prediction results, and (ii) the challenge organizers algorithmically merged the chemical entity annotations in the full text for all predicted chemical entities and used a statistical approach to keep those with the highest degree of confidence. As a result, the NLM-Chem-BC7 Chemical Indexing corpus is a gold-standard corpus for chemical indexing of journal articles and a silver-standard corpus for chemical entity identification in full-text journal articles. Together, these resources are currently the most comprehensive resources for chemical entity recognition, and we demonstrate improvements in the chemical entity recognition algorithms. We detail the characteristics of these novel resources and make them available for the community.