Bridging Concept Identification for Constructing Information Networks from Text Documents (original) (raw)

Empirical study using network of semantically related associations in bridging the knowledge gap

Journal of translational medicine, 2014

BackgroundThe data overload has created a new set of challenges in finding meaningful and relevant information with minimal cognitive effort. However designing robust and scalable knowledge discovery systems remains a challenge. Recent innovations in the (biological) literature mining tools have opened new avenues to understand the confluence of various diseases, genes, risk factors as well as biological processes in bridging the gaps between the massive amounts of scientific data and harvesting useful knowledge.MethodsIn this paper, we highlight some of the findings using a text analytics tool, called ARIANA - Adaptive Robust and Integrative Analysis for finding Novel Associations.ResultsEmpirical study using ARIANA reveals knowledge discovery instances that illustrate the efficacy of such tool. For example, ARIANA can capture the connection between the drug hexamethonium and pulmonary inflammation and fibrosis that caused the tragic death of a healthy volunteer in a 2001 John Hopk...

Context-Driven Automatic Subgraph Creation for Literature-Based Discovery

Journal of biomedical informatics, 2015

Literature-based discovery (LBD) is characterized by uncovering hidden associations in non-interacting scientific literature. Prior approaches to LBD include use of: 1) domain expertise and structured background knowledge to manually filter and explore the literature, 2) distributional statistics and graph-theoretic measures to rank interesting connections, and 3) heuristics to help eliminate spurious connections. However, manual approaches to LBD are not scalable and purely distributional approaches may not be sufficient to obtain insights into the meaning of poorly understood associations. While several graph-based approaches have the potential to elucidate associations, their effectiveness has not been fully demonstrated. A considerable degree of a priori knowledge, heuristics, and manual filtering is still required. In this paper we implement and evaluate a context-driven, automatic subgraph creation method that captures multifaceted complex associations between biomedical conce...

Constructing an associative concept space for literature-based discovery

Journal of the American Society for Information Science and Technology, 2004

only be answered by combining information from various articles. In this paper, a new algorithm is proposed for finding associations between related concepts present in literature. To this end, concepts are mapped to a multi-dimensional space by a Hebbian type of learning algorithm using co-occurrence data as input. The resulting concept space allows exploration of the neighborhood of a concept and finding potentially novel relationships between concepts. The obtained information retrieval system is useful for finding literature supporting hypotheses and for discovering hitherto unknown relationships between concepts. Tests on artificial data show the potential of the proposed methodology. In addition, preliminary tests on a set of Medline abstracts yield promising results.

A Semantic Approach for Mining Hidden Links from Complementary and Non-interactive Biomedical Literature

Proceedings of the 2006 SIAM International Conference on Data Mining, 2006

Two complementary and non-interactive literature sets of articles, when they are considered together, can reveal useful information of scientific interest not apparent in either of the two sets alone. Swanson called the existence of such hidden links as undiscovered public knowledge (UPK). The novel connection between Raynaud disease and fish oils was uncovered from complementary and non-interactive biomedical literature by Swanson in 1986. Since then, there have been many approaches to uncover UPK by mining the biomedical literature. These earlier works, however, required substantial manual intervention to reduce the number of possible connections. This paper proposes a semantic-based mining model for undiscovered public knowledge using the biomedical literature. Our method replaces manual ad-hoc pruning by using semantic knowledge from t he biomedical ontologies. Using the semantic types and semantic relationships of the biomedical concepts, our prototype system can identify the relevant concepts collected from Medline and generate the novel hypothesis between these concepts. The system successfully replicates Swanson's two famous discoveries: Raynaud disease/fish oils and migraine/magnesium. Compared with previous approaches such as LSI-based and traditional association rule-based methods, our method generates much fewer but more relevant novel hypotheses, and requires much less human intervention in the discovery procedure

Discovering Relations between Indirectly Connected Biomedical Concepts

The complexity and scale of the knowledge in the biomedical domain has motivated research work towards mining heterogeneous data from structured and unstructured knowledge bases. Towards this direction, it is necessary to combine facts in order to formulate hypotheses or draw conclusions about the domain concepts. In this work we attempt to address this problem by using indirect knowledge connecting two concepts in a graph to identify hidden relations between them. The graph represents concepts as vertices and relations as edges, stemming from structured (ontologies) and unstructured (text) data. In this graph we attempt to mine path patterns which potentially characterize a biomedical relation. For our experimental evaluation we focus on two frequent relations, namely "has target", and "may treat". Our results suggest that relation discovery using indirect knowledge is possible, with an AUC that can reach up to 0.8. Finally, analysis of the results indicates tha...

Hypotheses Generation as Supervised Link Discovery with Automated Class Lebeling on Large-scale Biomedical Concept Networks

Computational approaches to generate hypotheses from biomedical literature have been studied intensively in recent years. Nevertheless, it still remains a challenge to automatically discover novel, cross-silo biomedical hypotheses from large-scale literature repositories. In order to address this challenge, we first model a biomedical literature repository as a comprehensive network of biomedical concepts and formulate hypotheses generation as a process of link discovery on the concept network. We extract the relevant information from the biomedical literature corpus and generate a concept network and concept-author map on a cluster using Map-Reduce framework. We extract a set of heterogeneous features such as random walk based features, neighborhood features and common author features. The potential number of links to consider for the possibility of link discovery is large in our concept network and to address the scalability problem, the features from a concept network are extracted using a cluster with Map-Reduce framework. We further model link discovery as a classification problem carried out on a training data set automatically extracted from two network snapshots taken in two consecutive time duration. A set of heterogeneous features, which cover both topological and semantic features derived from the concept network, have been studied with respect to their impacts on the accuracy of the proposed supervised link discovery process. A case study of hypotheses generation based on the proposed method has been presented in the paper.

An integrative computational approach to identify disease-specific networks from PubMed literature information

2013 IEEE International Conference on Bioinformatics and Biomedicine, 2013

A huge amount of association relationships among biological entities (e.g., diseases, drugs, and genes) are scattered in biomedical literature. How to extract and analyze such heterogeneous data still remains a challenging task for most researchers in the biomedical field. Natural language processing (NLP) has the potential in extracting associations among biological entities from literature. However, association information extracted through NLP can be large, noisy, and redundant which poses significant challenges to biomedical researchers to use such information. To address this challenge, we propose a computational framework to facilitate the use of NLP results. We apply Latent Dirichlet Allocation (LDA) to discover topics based on associations. The networks extracted from each topic provide a disease-specific network for downstream bioinformatics analysis of associations for each topic. We illustrated the framework through the construction of disease-specific networks from Semantic MEDLINE, an NLP-generated association database, followed by the analysis of network properties, such as hub nodes and degree distribution. The results demonstrate that (1) LDA-based approach can group related diseases into the same disease topic; (2) the disease-specific association network follows the scale-free network property, in which hub nodes are enriched in related diseases, genes and drugs.

HiPub: Translating PubMed and PMC Texts to Networks for Knowledge Discovery

Bioinformatics, 2016

We introduce HiPub, a seamless Chrome browser plug-in that automatically recognizes, annotates and translates biomedical entities from texts into networks for knowledge discovery. Using a combination of two different named-entity recognition resources, HiPub can recognize genes, proteins, diseases, drugs, mutations and cell lines in texts, and achieve high precision and recall. HiPub extracts biomedical entity-relationships from texts to construct context-specific networks, and integrates existing network data from external databases for knowledge discovery. It allows users to add additional entities from related articles, as well as user-defined entities for discovering new and unexpected entity-relationships. HiPub provides functional enrichment analysis on the biomedical entity network, and link-outs to external resources to assist users in learning new entities and relations.

Hypotheses generation as supervised link discovery with automated class labeling on large-scale biomedical concept networks

BMC Genomics, 2012

Computational approaches to generate hypotheses from biomedical literature have been studied intensively in recent years. Nevertheless, it still remains a challenge to automatically discover novel, cross-silo biomedical hypotheses from large-scale literature repositories. In order to address this challenge, we first model a biomedical literature repository as a comprehensive network of biomedical concepts and formulate hypotheses generation as a process of link discovery on the concept network. We extract the relevant information from the biomedical literature corpus and generate a concept network and concept-author map on a cluster using Map-Reduce framework. We extract a set of heterogeneous features such as random walk based features, neighborhood features and common author features. The potential number of links to consider for the possibility of link discovery is large in our concept network and to address the scalability problem, the features from a concept network are extracted using a cluster with Map-Reduce framework. We further model link discovery as a classification problem carried out on a training data set automatically extracted from two network snapshots taken in two consecutive time duration. A set of heterogeneous features, which cover both topological and semantic features derived from the concept network, have been studied with respect to their impacts on the accuracy of the proposed supervised link discovery process. A case study of hypotheses generation based on the proposed method has been presented in the paper.