Automated recognition of malignancy mentions in biomedical literature (original) (raw)
Related papers
Machine Learning Approach for Cancer Entities Association and Classification
arXiv (Cornell University), 2023
According to the World Health Organization (WHO), cancer is the second leading cause of death globally. Scientific research on different types of cancers grows at an ever-increasing rate, publishing large volumes of research articles every year. The insight information and the knowledge of the drug, diagnostics, risk, symptoms, treatments, etc., related to genes are significant factors that help explore and advance the cancer research progression. Manual screening of such a large volume of articles is very laborious and time-consuming to formulate any hypothesis. The study uses the two most non-trivial NLP, Natural Language Processing functions, Entity Recognition, and text classification to discover knowledge from biomedical literature. Named Entity Recognition (NER) recognizes and extracts the predefined entities related to cancer from unstructured text with the support of a user-friendly interface and built-in dictionaries. Text classification helps to explore the insights into the text and simplifies data categorization, querying , and article screening. Machine learning classifiers are also used to build the classification model and Structured Query Languages (SQL) is used to identify the hidden relations that may lead to significant predictions.
Identifying and extracting malignancy types in cancer literature
2005
ABSTRACT Summary: MTag is an application for identifying and extracting clinical descriptions of malignancy presented in text. The application uses the machine learning technique Conditional Random Fields and incorporates domain-specific features. MTag was tested with 1,010 training and 432 evaluation documents pertaining to cancer genomics. Our experiments resulted in 0.85 precision, 0.82 recall, and 0.83 F-measure on the evaluation set. Availability: The software is available at http://bioie. ldc. upenn. edu/index.
Text mining of cancer-related information: Review of current status and future directions
International Journal of Medical Informatics, 2014
This paper reviews the research literature on text mining (TM) with the aim to find out (1) which cancer domains have been the subject of TM efforts, (2) which knowledge resources can support TM of cancer-related information and (3) to what extent systems that rely on knowledge and computational methods can convert text data into useful clinical information. These questions were used to determine the current state of the art in this particular strand of TM and suggest future directions in TM development to support cancer research. Methods: A review of the research on TM of cancer-related information was carried out. A literature search was conducted on the Medline database as well as IEEE Xplore and ACM digital libraries to address the interdisciplinary nature of such research. The search results were supplemented with the literature identified through Google Scholar. Results: A range of studies have proven the feasibility of TM for extracting structured information from clinical narratives such as those found in pathology or radiology reports. In this article, we provide a critical overview of the current state of the art for TM related to cancer. The review highlighted a strong bias towards symbolic methods, e.g. named entity recognition (NER) based on dictionary lookup and information extraction (IE) relying on pattern matching. The F-measure of NER ranges between 80% and 90%, while that of IE for simple tasks is in the high 90s. To further improve the performance, TM approaches need to deal effectively with idiosyncrasies of the clinical sublanguage such as non-standard abbreviations as well as a high degree of spelling and grammatical errors. This requires a shift from rule-based methods to machine learning following the success of similar trends in biological applications of TM. Machine learning approaches require large training datasets, but clinical narratives are not readily available for TM research due to privacy and confidentiality concerns. This issue remains the main bottleneck for progress in this area. In addition, there is a need for a comprehensive cancer ontology that would enable semantic representation of textual information found in narrative reports.
An improved corpus of disease mentions in PubMed citations
aclweb.org
The latest discoveries on diseases and their diagnosis/treatment are mostly disseminated in the form of scientific publications. However, with the rapid growth of the biomedical literature and a high level of variation and ambiguity in disease names, the task of retrieving disease-related articles becomes increasingly challenging using the traditional keywordbased approach. An important first step for any disease-related information extraction task in the biomedical literature is the disease mention recognition task. However, despite the strong interest, there has not been enough work done on disease name identification, perhaps because of the difficulty in obtaining adequate corpora. Towards this aim, we created a large-scale disease corpus consisting of 6900 disease mentions in 793 PubMed citations, derived from an earlier corpus. Our corpus contains rich annotations, was developed by a team of 12 annotators (two people per annotation) and covers all sentences in a PubMed abstract. Disease mentions are categorized into Specific Disease, Disease Class, Composite Mention and Modifier categories. When used as the gold standard data for a state-of-the-art machine-learning approach, significantly higher performance can be found on our corpus than the previous one. Such characteristics make this disease name corpus a valuable resource for mining disease-related information from biomedical text. The NCBI corpus is available for download at http://www.ncbi.nlm.nih.gov/CBBresearch/Fe llows/Dogan/disease.html.
Large-Scale Application of Named Entity Recognition to Biomedicine and Epidemiology
BackgroundDespite significant advancements in biomedical named entity recognition methods, the clinical application of these systems continues to face many challenges: (1) most of the methods are trained on a limited set of clinical entities; (2) these methods are heavily reliant on a large amount of data for both pretraining and prediction, making their use in production impractical; (3) they do not consider non-clinical entities, which are also related to patient’s health, such as social, economic or demographic factors.MethodsIn this paper, we develop Bio-Epidemiology-NER (https://pypi.org/project/Bio-Epidemiology-NER/) an open-source Python package for detecting biomedical named entities from the text. This approach is based on Transformer-based approach and trained on a dataset that is annotated with many named entities (medical, clinical, biomedical and epidemiological). This approach improves on previous efforts in three ways: (1) it recognizes many clinical entity types, suc...
International Journal of Business Intelligence and Data Mining, 2016
The paper presents a novel application to extract biomedical entities automatically using machine learning techniques from large volumes of biomedical text. The data in large quantities are accumulating day by day and requires automatic extraction of information. Data mining is the science of extracting information from large data. Biomedical Named entity recognition (BioNER) is the task of data mining that extracts named entities from biological texts. In this paper, we focus on developing a BioNER system for extraction of biological target, disease and chemical entities from biomedical texts. We developed the system using graphical based machine learning technique the CRFs. We have applied a set of diverse features containing standard lexical, syntactic and orthographic features combined with novel and biologically inspired features, action terms and process verbs. The system was evaluated with three widely recognised datasets. The results demonstrated the portability and the potency of the system.
IberLEF@SEPLN, 2020
Cancer still represents one of the leading causes of death worldwide, resulting in a considerable healthcare impact. Recent research efforts from the clinical and molecular oncology scientific communities were able to increase considerably life expectancy of patients for some cancer types. Most of the current cancer diagnoses are primarily determined by pathology laboratories, providing an essential source for information to guide the treatment of patients with cancer. Pathology observations essentially characterize the results of microscopic or macroscopic studies of cells or tissues following a biopsy or surgery. Clinicians and researchers alike, require systems that automatically detect, read and generate structured data representations from pathology examinations. The resulting structured or coded clinical information, normalized using controlled vocabularies like the ICD-O or SNOMED-CT is critical for large-scale analysis of specific tumor types or to determine response to specific treatments or prognosis. Text mining and NLP approaches are showing promising results to transform medical text into useful clinical information, bridging the gap between free-text and structured representation of clinical information. Nonetheless, in the case of cancer text mining approaches, most efforts were exclusively focused on medical records in English. Moreover, due to the lack of high quality manually labeled clinical texts annotated by oncology experts most previous efforts, even for English relied mainly on customized dictionaries of names or rules to recognize clinical concept mentions despite the promising results of advanced deep learning technologies. To address these issues we have organized the Cantemist (CANcer TExt Mining Shared Task) track at IberLEF 2020. It represents the first community effort to evaluate and promote the development of resources for named entity recognition, concept normalization and clinical coding specifically focusing on cancer data in Spanish. Evaluation of participating systems was done using the Cantemist corpus, a publicly accessible dataset (together with annotation consistency analysis and guidelines) of manually annotated mentions of tumor morphology entities and their mappings to the Spanish version of ICD-O. We received a total of 121 systems or runs from 25 teams for one of the three Cantemist sub-tasks, obtaining very competitive results. Most participants implemented sophisticated AI approaches; mainly deep learning algorithms based on Long-Short Term Memory Units and language models (BERT, BETO, RoBERTa, etc) with a classifier layer such as a Conditional Random Field. In addition to using pre-trained language models, word and character embeddings were also explored.
Assessment of disease named entity recognition on a corpus of annotated sentences
BMC Bioinformatics, 2008
In recent years, the recognition of semantic types from the biomedical scientific literature has been focused on named entities like protein and gene names (PGNs) and gene ontology terms (GO terms). Other semantic types like diseases have not received the same level of attention. Different solutions have been proposed to identify disease named entities in the scientific literature. While matching the terminology with language patterns suffers from low recall (e.g., Whatizit) other solutions make use of morpho-syntactic features to better cover the full scope of terminological variability (e.g., MetaMap). Currently, MetaMap that is provided from the National Library of Medicine (NLM) is the state of the art solution for the annotation of concepts from UMLS (Unified Medical Language System) in the literature. Nonetheless, its performance has not yet been assessed on an annotated corpus. In addition, little effort has been invested so far to generate an annotated dataset that links disease entities in text to disease entries in a database, thesaurus or ontology and that could serve as a gold standard to benchmark text mining solutions.
Disease Named Entity Recognition Using NCBI Corpus
2016
Named Entity Recognition (NER) in biomedical literature is a very active research area. NER is a crucial component of biomedical text mining because it allows for information retrieval, reasoning and knowledge discovery. Much research has been carried out in this area using semantic type categories, such as “DNA”, “RNA”, “proteins” and “genes”. However, disease NER has not received its needed attention yet, specifically human disease NER. Traditional machine learning approaches lack the precision for disease NER, due to their dependence on token level features, sentence level features and the integration of features, such as orthographic, contextual and linguistic features. In this paper a method for disease NER is proposed which utilizes sentence and token level features based on Conditional Random Fields using the NCBI disease corpus. Our system utilizes rich features including orthographic, contextual, affixes, bigrams, part of speech and stem based features. Using these feature ...
Mining biomedical information from scientific literature
2013
The rapid evolution and proliferation of a world-wide computerized network, the Internet, resulted in an overwhelming and constantly growing amount of publicly available data and information, a fact that was also verified in biomedicine. However, the lack of structure of textual data inhibits its direct processing by computational solutions. Information extraction is the task of text mining that intends to automatically collect information from unstructured text data sources. The goal of the work described in this thesis was to build innovative solutions for biomedical information extraction from scientific literature, through the development of simple software artifacts for developers and biocurators, delivering more accurate, usable and faster results. We started by tackling named entity recognition a crucial initial task with the development of Gimli, a machine-learning-based solution that follows an incremental approach to optimize extracted linguistic characteristics for each c...