Adaptive Integration of Structured and Unstructured Data from Many Sources in a Biological Domain (original) (raw)
Description
Our goal in this research is to construct a knowledge-based (KB) system which will learn to more accurately integrate the many heterogeneous sources of information that are relevant to a single scientist's research needs. The system, called Querendipity, works by loosely integrating data of many sorts (including unstructured text) into a single typed directed graph, and then querying the graph using a query language that allows "schema-free similarity queries". These queries specify a set of query terms (e.g. keywords, entities in the KB, etc) and constraints on the desired output (e.g. a target data type). The result of a query is a ranked list of KB entities, ordered by similarity to the query terms.
After a query, a user can optionally label any subset of the ranked list of suggested answers as ``relevant'' or ``non-relevant''. These labels drive a learning phase, the goal of which is to produce a better ranking. Types of learning currently being investigated include EM-based parameter turning, learning to discriminatively re-rank, and learning to restructure the graph (by adding or deleting edges or vertexes). Queries collected in the laboratories of working biologists are used to evaluate these learning methods.
The broadest impact of this project is on the problem of learning to integrate heterogeneous data sources (including free text and structured data). However, if successful, the KB system will have broad impact in the biological research community; in particular, we believe that adaptive personal KB systems of this sort will be a valuable complement to existing biological KBs.
Acknowledgements
This project is funded by the NSF's Division of Information & Intelligent Systems as award 0811562from September 1, 2008 through August 31, 2011.
Project Members
Participants include
- William W. Cohen, of the Lane Center for Computational Biology and the Machine Learning Department, PI.
- John Woolford of the Department of Biology, coPI.
- Ramnath Balasubramanyan, LTI PhD student
- Ni Lao, LTI PhD student
- Frank Lin, LTI PhD student
- Dana Movshovitz-Attias, CSD PhD student.
- Katie Rivard, research programmer/analyst
- Maryam Aly, undergraduate research assistant (during fall semester 2009)
- Andrew Arnold(former MLD PhD student, now at WorldQuant)
Relevant publications
Below are some of the publications most relevant to the research behind Querendipity.
Completed
- Ramnath Balasubramanyan and William W. Cohen (2011): Block-LDA: Jointly modeling entity-annotated text and entity-entity links in SDM-2011.
- Ni Lao and William W. Cohen (2010): Relational Retrieval Using a Combination of Path-Constrained Random Walks in ECML-2010.
- Ni Lao and William W. Cohen (2010): Fast Query Execution for Retrieval Models based on Path Constrained Random Walks in KDD-2010.
- Ramnath Balasubramanyan and William W. Cohen (2010): Block-LDA: Jointly modeling entity-annotated text and entity-entity links in ICML-2010 Workshop on Topic Modeling.
- Frank Lin and William W. Cohen (2010): Semi-Supervised Classification of Network Data Using Very Few Labels in ASONAM-2010.
- Frank Lin and William W. Cohen (2010): Power Iteration Clustering in ICML-2010.
- Frank Lin and William W. Cohen (2010): A Very Fast Method for Clustering Big Text Datasets in ECAI-2010.
- Andrew Arnold and William W. Cohen (2009): Information Extraction as Link Prediction: Using Curated Citation Networks to Improve Gene Detection in SNAS-2009.
- Andrew Arnold and William W. Cohen (2009): Information Extraction as Link Prediction: Using Curated Citation Networks to Improve Gene Detection in ICWSM-2009 (poster).
- Joanna Bresee, Hajin Choi, Daniel Lee, Ellen Wu (2009): Adaptive Personalized Information Management for Biologists: Final Report (report)
Integration Software
- SourceForge page for GHIRL, the underlying graph-query system.
- Web-based query-system running a version of Querendipity configured for John Woolford's lab
- Earlier version of Querendipity.
Integration Datasets
- Yeast datasets:
- Fly datasets:
System Snapshots
Snapshots of the system, with code and data, as of a particular date.
Querendipity stands for Query-based User-guidedExploration of Relations andENtities in Data IntegratedProbabilistically or Identified inText about Yeast. Development of a model-organism independent acronym is a subject for further research.
Last modified: Thu Jul 14 14:36:58 Eastern Daylight Time 2011