Biomedical Text Mining Applied To Document Retrieval and Semantic Indexing (original) (raw)

BioDR: Semantic indexing networks for biomedical document retrieval

Expert Systems With Applications, 2010

In Biomedical research, retrieving documents that match an interesting query is a task performed quite frequently. Typically, the set of obtained results is extensive containing many non-interesting documents and consists in a flat list, i.e., not organized or indexed in any way. This work proposes BioDR, a novel approach that allows the semantic indexing of the results of a query, by identifying relevant terms in the documents. These terms emerge from a process of Named Entity Recognition that annotates occurrences of biological terms (e.g. genes or proteins) in abstracts or full-texts. The system is based on a learning process that builds an Enhanced Instance Retrieval Network (EIRN) from a set of manually classified documents, regarding their relevance to a given problem. The resulting EIRN implements the semantic indexing of documents and terms, allowing for enhanced navigation and visualization tools, as well as the assessment of relevance for new documents.

Indexing Biomedical Documents With a Possibilistic Network

In this article, we propose a new approach for indexing biomedical documents based on a possibilistic network that carries out partial matching between documents and biomedical vocabulary. The main contribution of our approach is to deal with the imprecision and uncertainty of the indexing task using possibility theory. We enhance estimation of the similarity between a document and a given concept using the two measures of possibility and necessity. Possibility estimates the extent to which a document is not similar to the concept. The second measure can provide confirmation that the document is similar to the concept. Our contribution also reduces the limitation of partial matching. Although the latter allows extracting from the document other variants of terms than those in dictionaries, it also generates irrelevant information. Our objective is to filter the index using the knowledge provided by the Unified Medical Language System®. Experiments were carried out on different corpora, showing encouraging results (the improvement rate is +26.37% in terms of main average precision when compared with the baseline).

A survey of current work in biomedical text mining

The volume of published biomedical research, and therefore the underlying biomedical knowledge base, is expanding at an increasing rate. Among the tools that can aid researchers in coping with this information overload are text mining and knowledge extraction. Significant progress has been made in applying text mining to named entity recognition, text classification, terminology extraction, relationship extraction and hypothesis generation. Several research groups are constructing integrated flexible text-mining systems intended for multiple uses. The major challenge of biomedical text mining over the next 5-10 years is to make these systems useful to biomedical researchers. This will require enhanced access to full text, better understanding of the feature space of biomedical literature, better methods for measuring the usefulness of systems to users, and continued cooperation with the biomedical research community to ensure that their needs are addressed.

Mining the biomedical literature using semantic analysis and natural language processing techniques

BIOSILICO, 2003

The information age has made the electronic storage of large amounts of data effortless.The proliferation of documents available on the Internet, corporate intranets, news wires and elsewhere is overwhelming. Search engines only exacerbate this overload problem by making increasingly more documents available in only a few keystrokes. This information overload also exists in the biomedical field, where scientific publications, and other forms of text-based data are produced at an unprecedented rate. Text mining is the combined, automated process of analyzing unstructured,natural language text to discover information and knowledge that are typically difficult to retrieve. Here, we focus on text mining as applied to the biomedical literature. We focus in particular on finding relationships among genes, proteins, drugs and diseases, to facilitate an understanding and prediction of complex biological processes. The LitMiner™ system, developed specifically for this purpose; is described in relation to the Knowledge Discovery and Data Mining Cup 2002, which serves as a formal evaluation of the system. www.drugdiscoverytoday.com REVIEWS RESEARCH FOCUS

Information retrieval and knowledge discovery in biomedical text : papers from the AAAI Fall Symposium

AAAI Press eBooks, 2012

Before undertaking new biomedical research, identifying concepts that have already been patented is essential. A traditional keyword-based search on patent databases may not be sufficient to retrieve all the relevant information, especially for the biomedical domain. This paper presents BioPatentMiner, a system that facilitates information retrieval and knowledge discovery from biomedical patents. The system first identifies biological terms and relations from the patents and then integrates the information from the patents with knowledge from biomedical ontologies to create a Semantic Web. Besides keyword search and queries linking the properties specified by one or more RDF triples, the system can discover semantic associations between the Web resources. The system also determines the importance of the resources to rank the results of a search and prevent information overload while determining the semantic associations.

Five Steps to Text Mining in Biomedical Literature

In this paper we discuss our plans and progress on analysing and integrating various methods of text mining in biomedical literature for a Ph.D project. The framework is a clientbased search engine that integrates different machine learning and text mining techniques. For convenience all text mining procedures have been split into five steps, so the different steps can be individually optimised while the rest of the procedures stay un touched. By the use of common interfaces the steps are interchangeable, thus simplifying the objective comparison between them.

Construction of Gene Correlation Networks and Text Classification via Biomedical Literature Mining

Automatic extraction of information from biomedical texts appears as a necessity considering the growing of the massive amounts of the relative scientific literature. A special feature that makes this task more challenging is the over-abundance and heterogeneity of the relative genes/proteins terminology. In this paper we introduce a novel term-identification process and propose an effective data structure based on TRIE trees. It enables the storage of millions of biomedical terms and reflects their semantic relations in a compressed and memory efficient way. Gene-Gene and Gene-Disease correlations are induced based on the utilization of the entropic Mutual Information Measure. Moreover we introduce a novel texts classification process that utilizes the terms identification process and a novel similarity matching metric. The induced correlation networks reveal valuable biomedical information. Text classification results exhibit highly accuracy figures in the range of 90 to 97.5% indicating the reliability of the whole approach.

Text-mining approaches in molecular biology and biomedicine

Drug Discovery Today, 2005

Biomedical articles provide functional descriptions of bioentities such as chemical compounds and proteins. To extract relevant information using automatic techniques, text-mining and information-extraction approaches have been developed. These technologies have a key role in integrating biomedical information through analysis of scientific literature. In this article, important applications such as the identification of biologically relevant entities in free text and the construction of literature-based networks of protein-protein interactions will be introduced. Also, the use of text mining to aid the interpretation of microarray data and the analysis of pathology reports will be discussed. Finally, we will consider the recent evolution of this field and the efforts for community-based evaluations.

Intelligent Agent System for Bio-medical Literature Mining

2007 International Conference on Information and Communication Technology, 2007

Email: mislam(&micros.com , dbollina(bio.mg.edu.au12, abhay ics.mg.edu.au, shoba(i4els.mq.edu.aul"2,4 Abstract obstacles to develop clinically useful biomarker tests, including technical challenges associated with With the advances of World Wide Web technology validating potential markers, and challenges associand advanced research in bioinformatics and sys-ated with developing, evaluating, and incorporating tems biology domain has highlighted the increasing the screening and diagnostics that make use of those need for Automatic Information Extraction [IE] Sys-markers into clinical practice. tem to extract information from scientific literature databases. Extraction of scientific information in Over the past few decades there has been remarkbiomedical articles is a central task for supporting able growth in the amount of biomedical data. In Biomarker discovery efforts. In this paper, we pro-particular, the sequencing of the human genome and pose an algorithm that is capable of extracting scien-of quite a few other organisms has generated comtific information on biomarker like gene, genome, plete genomic sequences of unprecedented number disease, allele, cell etc from the text by finding out and size. This development is accompanied by much the focal topic of the document and extracting the data of various kinds, including protein sequences, most relevant properties of that topic. The topic and results from large-scale genomic and proteomic exits properties are represented as semantic networks periments, and a lot of published literature [10]. and then stored in a database. This IE algorithm will These literatures are potential source of knowledge extract the most important biological terms and rela-discovery and can help scientists to gather recent tion by statistical and pattern matching NLP tech-research outcomes on biomedical concepts such as niques. This IE tool expected to help the researchers genes, proteins, diseases, drug discovery and many to get the latest information on Biomarker discovery other topics [11]. and its other biomedical research advances. We showitsprei inary res emratingta. te Numbers of articles published each year for biomethod has a strong potential to biomarker discovmedical domain is increasing rapidly, which makes eymethodhas. a strong potential to biomarkerdiscovit no longer possible for a researcher to read all the relevant articles manually. Figure-1 shows the

Note: a workbench for biomedical text mining

Journal of biomedical informatics, 2009

Biomedical Text Mining (BioTM) is providing valuable approaches to the automated curation of scientific literature. However, most efforts have addressed the benchmarking of new algorithms rather than user operational needs. Bridging the gap between BioTM researchers and biologists’ needs is crucial to solve real-world problems and promote further research.We present @Note, a platform for BioTM that aims at the effective translation of the advances between three distinct classes of users: biologists, text miners and software developers. Its main functional contributions are the ability to process abstracts and full-texts; an information retrieval module enabling PubMed search and journal crawling; a pre-processing module with PDF-to-text conversion, tokenisation and stopword removal; a semantic annotation schema; a lexicon-based annotator; a user-friendly annotation view that allows to correct annotations and a Text Mining Module supporting dataset preparation and algorithm evaluation.@Note improves the interoperability, modularity and flexibility when integrating in-home and open-source third-party components. Its component-based architecture allows the rapid development of new applications, emphasizing the principles of transparency and simplicity of use. Although it is still on-going, it has already allowed the development of applications that are currently being used.