Terminology spectrum analysis of natural-language chemical documents: term-like phrases retrieval routine (original) (raw)

Identifying and classifying terms in the life sciences: The case of chemical terminology

Facing the huge amount of textual and terminological data in the life sciences, we present a theoretical basis for the linguistic analysis of chemical terms. Starting with organic compound names, we conduct a morpho-semantic deconstruction into morphemes and yield a semantic representation of the terms' functional and structural properties. These semantic representations imply both the molecular structure of the named molecules and their class membership. A crucial feature of this analysis, which distinguishes it from all similar existing systems, is its ability to deal with terms that do not fully specify a structure as well as terms for generic classes of chemical compounds. Such 'underspecified' terms occur very frequently in scientific literature. Our approach will serve for the support of manual database curation and as a basis for text processing applications.

Analysis of biomedical text for chemical names: a comparison of three methods

Proceedings / AMIA ... Annual Symposium. AMIA Symposium, 1999

At the National Library of Medicine (NLM), a variety of biomedical vocabularies are found in data pertinent to its mission. In addition to standard medical terminology, there are specialized vocabularies including that of chemical nomenclature. Normal language tools including the lexically based ones used by the Unified Medical Language System (UMLS) to manipulate and normalize text do not work well on chemical nomenclature. In order to improve NLM's capabilities in chemical text processing, two approaches to the problem of recognizing chemical nomenclature were explored. The first approach was a lexical one and consisted of analyzing text for the presence of a fixed set of chemical segments. The approach was extended with general chemical patterns and also with terms from NLM's indexing vocabulary, MeSH, and the NLM SPECIALIST lexicon. The second approach applied Bayesian classification to n-grams of text via two different methods. The single lexical method and two statisti...

Information Retrieval and Text Mining Technologies for Chemistry

Chemical Reviews, 2017

Efficient access to chemical information contained in scientific literature, patents, technical reports, or the web is a pressing need shared by researchers and patent attorneys from different chemical disciplines. Retrieval of important chemical information in most cases starts with finding relevant documents for a particular chemical compound or family. Targeted retrieval of chemical documents is closely connected to the automatic recognition of chemical entities in the text, which commonly involves the extraction of the entire list of chemicals mentioned in a document, including any associated information. In this Review, we provide a comprehensive and in-depth description of fundamental concepts, technical implementations, and current technologies for meeting these information demands. A strong focus is placed on community challenges addressing systems performance, more particularly CHEMDNER and CHEMDNER patents tasks of BioCreative IV and V, respectively. Considering the growing interest in the construction of automatically annotated chemical knowledge bases that integrate chemical information and biological data, cheminformatics approaches for mapping the extracted chemical names into chemical structures and their subsequent annotation together with text mining applications for linking chemistry with biological information are also presented. Finally, future trends and current challenges are highlighted as a roadmap proposal for research in this emerging field.

Factors affecting the effectiveness of biomedical document indexing and retrieval based on terminologies

Artificial Intelligence in Medicine, 2013

The aim of this work is to evaluate a set of indexing and retrieval strategies based on the integration of several biomedical terminologies on the available TREC Genomics collections for an ad hoc information retrieval (IR) task. Materials and methods: We propose a multi-terminology based concept extraction approach to selecting best concepts from free text by means of voting techniques. We instantiate this general approach on four terminologies (MeSH, SNOMED, ICD-10 and GO). We particularly focus on the effect of integrating terminologies into a biomedical IR process, and the utility of using voting techniques for combining the extracted concepts from each document in order to provide a list of unique concepts. Results: Experimental studies conducted on the TREC Genomics collections show that our multiterminology IR approach based on voting techniques are statistically significant compared to the baseline. For example, tested on the 2005 TREC Genomics collection, our multi-terminology based IR approach provides an improvement rate of +6.98% in terms of MAP (mean average precision) (p < 0.05) compared to the baseline. In addition, our experimental results show that document expansion using preferred terms in combination with query expansion using terms from top ranked expanded documents improve the biomedical IR effectiveness. Conclusion: We have evaluated several voting models for combining concepts issued from multiple terminologies. Through this study, we presented many factors affecting the effectiveness of biomedical IR system including term weighting, query expansion, and document expansion models. The appropriate combination of those factors could be useful to improve the IR performance.

Towards the enrichment of terminological resources by scientific corpora analysis

2015

The research presented in this paper explores the possibility of enriching terminological databases through the analysis of recent scientific publications. Our main concern is to evaluate how useful automatic term extraction can be to a human expert. To carry out our experiment, we constructed two corpora of recent scientific papers in two different sub-domains of the bio-medical sciences. Then we proceeded with three steps: automatic term extraction and ranking from a set of corpora of scientific papers; evaluation of the overlap of the candidate terms (CTs) extracted from the corpora and those present in the multidisciplinary terminology portal TermSciences; and evaluation by domain experts of the three sets of the top 200 CTs extracted from the different corpora. To extract terms we used the Sensunique Platform, a web based platform for building terminological resources. Our results show that only about 10% of the extracted CTs are present in the TermSciences resource, which mean...

Towards a terminological resource for biomedical text mining

2006

One of the main challenges in biomedical text mining is the identification of terminology, which is a key factor for accessing and integrating the information stored in literature. Manual creation of biomedical terminologies cannot keep pace with the data that becomes available. Still, many of them have been used in attempts to recognise terms in literature, but their suitability for text mining has been questioned as substantial re-engineering is needed to tailor the resources for automatic processing. Several approaches have been suggested to automatically integrate and map between resources, but the problems of extensive variability of lexical representations and ambiguity have been revealed. In this paper we present a methodology to automatically maintain a biomedical terminological database, which contains automatically extracted terms, their mutual relationships, features and possible annotations that can be useful in text processing. In addition to TermDB, a database used for terminology management and storage, we present the following modules that are used to populate the database: TerMine (recognition, extraction and normalisation of terms from literature), AcroTerMine (extraction and clustering of acronyms and their long forms), AnnoTerm (annotation and classification of terms), and ClusTerm (extraction of term associations and clustering of terms).

Term extraction and correlation analysis based on massive scientific and technical literature

International Journal of Computational Science and Engineering, 2017

Scientific and technical term is the basic unit of knowledge discovery and organisation construction. Correlation analysis is one of the important technologies for the deep data mining of massive, different scientific and technical literature. Based on the freely available digital library resources, this study adopts the technology of natural language processing to analyse the linguistics characteristics of terms, and combines with statistical analyses to extract the terms from scientific and technical literature. Using the results of term extraction, the paper proposes the algorithm of improved VSM towards correlation calculation for analysing different scientific and technical literature. According to the experimental results, it proposes a new way and possibility to automatically extract terms and realise correlation analysis for different literature from massive scientific and technical literature. Our method is superior to the method of unadopting linguistic rules and MI calculation. The accuracy of terms is about 73.5%. Compared with the traditional VSM based on terms, the correct rate of correlation calculation is increased by 12%.

Extraction and search of chemical formulae in text documents on the web

Proceedings of the 16th international conference on World Wide Web - WWW '07, 2007

Often scientists seek to search for articles on the Web re-lated to a particular chemical. When a scientist searches for a chemical formula using a search engine today, she gets articles where the exact keyword string expressing the chem-ical formula is found. Searching for the ...

INFORMATION EXTRACTION METHODS AND EXTRACTION TECHNIQUES IN THE CHEMICAL DOCUMENT'S CONTENTS: SURVEY

2020

The volume of electronic documents has rapidly increased and the scientific literature has increased too. These huge documents contain considerable information, but it has to be retrieved and managed in a constructive and useful way. Information Extraction (IE) is the field of extracting useful information using different methods and approaches by means of Natural Language Processing (NLP). Researchers still continue to try to identify proper methods to extract information from texts, such as opinions on the internet, medical data, clinical reports, medical reports, notes, papers, patents, etc. Recently a new trend to expand working in IE is taking place by enriching the extraction process to include the extraction of information from images and videos. In this paper, the classification of IE tasks is discussed, as well as the proposed methods and techniques of IE from chemical documents. A more focused approach is then taken into consideration regarding biomedical language processi...

Mining Biomedical Abstracts: What’s in a Term?

Lecture Notes in Computer Science, 2005

In this paper we present a study of the usage of terminology in biomedical literature, with the main aim to indicate phenomena that can be helpful for automatic term recognition in the domain. Our comparative analysis is based on the terminology used in the Genia corpus. We analyse the usage of ordinary biomedical terms as well as their variants (namely inflectional and orthographic alternatives, terms with prepositions, coordinated terms, etc.), showing the variability and dynamic nature of terms used in biomedical abstracts. Term coordination and terms containing prepositions are analysed in detail. We show that there is a discrepancy between terms used in literature and terms listed in controlled dictionaries. We also evaluate the effectiveness of incorporating different types of term variation into an automatic term recognition system.