Diagen: A Model-Driven Framework for Integrating Bioinformatic Tools (original) (raw)

Using conceptual modeling to improve genome data management

Briefings in Bioinformatics, 2020

With advances in genomic sequencing technology, a large amount of data is publicly available for the research community to extract meaningful and reliable associations among risk genes and the mechanisms of disease. However, this exponential growth of data is spread in over thousand heterogeneous repositories, represented in multiple formats and with different levels of quality what hinders the differentiation of clinically valid relationships from those that are less well-sustained and that could lead to wrong diagnosis. This paper presents how conceptual models can play a key role to efficiently manage genomic data. These data must be accessible, informative and reliable enough to extract valuable knowledge in the context of the identification of evidence supporting the relationship between DNA variants and disease. The approach presented in this paper provides a solution that help researchers to organize, store and process information focusing only on the data that are relevant a...

Interoperability between Bioinformatics Tools: A Logic Programming Approach

Lecture Notes in Computer Science, 2001

The goal of this project is to develop solutions to enhance interoperability between bioinformatics applications. Most existing applications adopt different data formats, forcing biologists into tedious translation work. We propose to use of Nexus as an intermediate representation language. We develop a complete grammar for Nexus and we adopt this grammar to build a parser. The construction heavily relies on the peculiar features of Prolog, to easily derive effective parsing and translating procedures. We also develop a general parse tree format suitable for interconversion between Nexus and other formats.

Database integration and querying in the bioinformatics domain

2005

Given the exponential growth in the amount of genetic data being produced, it is more important than ever for researchers to have effective tools to help them manage this data. This paper describes a system that enables users, generally biologists, to construct components to answer specific questions in their field. The system allows the creation of modules and submodules via top-down decomposition. Concepts and terms can be defined through conversation. These are then used when composing base-level functions to produce code for modules and for interfacing modules.

An agile model-driven approach for simplifying the development of genetic analysis tools

2012 Sixth International Conference on Research Challenges in Information Science (RCIS), 2012

In the last few years, genetic researchers have started to assemble their own genetic analysis tools by reusing and combining available software. Because software development environments are not widely accepted in the Genetics community, geneticists become software developers, and they are force to integrate different solutions and to face programming issues without the required knowledge. A solution to this issue lives in the simplification of the tailored tool development. Geneticists demand development environments where: 1) the required data can be expressed according to their knowledge, and 2) the most common functionality can be easily integrated without programming skills. This PhD work proposes the use of the model-driven paradigm for addressing both concerns and presents an agile way for developing genetic analysis tools.

Facing the Challenges of Genome Information Systems: A Variation Analysis Prototype

Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications, 2011

In Bioinformatics there is a lack of software tools that fit with the requirements demanded by biologists. For instance, when a DNA sample is sequenced, a lot of work have to be performed manually and several tools are used. The application of Information Systems (IS) principles into the development of bioinformatics tools opens a new interesting research path. One of the most promising approaches is the use of conceptual models in order to precisely define how genomic data is represented into an IS. This work introduces how to build a Genome Information System (GIS) using these principles. As a first step to achieve this goal, a conceptual model to formally describe genomic mutations is presented. In addition, as a proof of concept of this approach, a variation analysis prototype has been implemented using this conceptual model as a development core.

An object-oriented genetics information system

Proceedings of the 1993 ACM/SIGAPP symposium on Applied computing states of the art and practice - SAC '93, 1993

Sequence data is being produced by genomic sequencing laboratories at ever-increasing rates, making it impossible for individual researchers to keep track of all the new data that might afkt their research. Computer systems are needed so that researchers can access this data. The systems must support high-level interfaces that communicate in the language of the researchers, database systems that guarantee availability and consistency of the data, and powerful search systems that rapidly scan for similarities between sequences. We have developed a prototype system that includes a graphical user interface, an object-oriented database management system, and high-performance similarity search algorithms. The prototype has the potential to increase researchers' productivity by automating ermy of amotated sequence fragments as they are produced by sequencing machines, storing the fragmenta in the database, and automatically prcducing and displaying similarity search results of new sequences against the large public sequence datsbsses GenBank and PIR. This paper describes the prototype, discusses the kme!its of object-oriented databases for complex and changing sequence da@ and presents an object-oriented schema for genetic information. Graphical tools for annotating sequences, storing them in the database, automating similarity searches, and viewing similarity search results are presented. A new suffix tieebased data stnscture that supports rapid similarity searches on sequence data is introduced. Finally, future plans for the system are discussed.

A framework for molecular biology data integration

Procs. Workshop on Information …, 2001

Molecular biology data are placed in different databases, repositories and flat files, usually distributed over the web. Distinct data models with schemas that are often changing implement these heterogeneous data sources. It is very important to gather information about these data sources, including schemas and ontology. The usual approach to handle this information integration problem is to use a single model that captures all the needed data and related methods. Instead, this work proposes the use of a domain specific framework for molecular biology data access and applications. This way we can capture multiple schemas and preexisting data sources, besides having a tool for schema evolution maintenance and database instantiation. 1.

Building integrated systems for data representation and analysis in molecular biology

Proceedings of the Twenty-Seventh Hawaii International Conference on System Sciences HICSS-94, 1994

of eukaryotic genomes centered on the relations between genomic sequences and their chromosomal localization [8]. In their early stages of development, ColiGene and MultiMap integrated a few sequence analysis methods. But, little by little, we found useful to provide access to the methods associated with a peculiar knowledge base to the other. Moreover, it was obvious that the methods integrated in ColiGene and MultiMap lacked of standardization and that their integration in the knowledge bases was a bit artificial (i.e. they were more associated pieces of software than really integrated tools). This is why we have decided to separate methods from the biological objects, while maintaining communication possibilities between these two kind of knowledge. To this purpose, we have developed two complementary packages: Digit and Misa. The first one was primarily designed as a graphical layer for a knowledge base devoted to the methodology in multivariate analysis: Slot . The genericity of the tools available in Digit allows their use with any knowledge base developed with the Shirka system. After Digit, we have developed Misa, a more specific system which is able to virtually integrate any available sequence analysis method, this due to its modular conception. The modules that are part of the Misa system are now our basis in the building of a more advanced system in which methods -defined as "tasks" -are managed under an "intelligent" system guiding the user in the choices and the chainings to do in a way to perform complex analyses.

Conceptual modelling of genomic information

…, 2000

Motivation: Genome sequencing projects are making available complete records of the genetic make-up of organisms. These core data sets are themselves complex, and present challenges to those who seek to store, analyse and present the information. However, in addition to the sequence data, high throughput experiments are making available distinctive new data sets on protein interactions, the phenotypic consequences of gene deletions, and on the transcriptome, proteome, and metabolome. The effective description and management of such data is of considerable importance to bioinformatics in the post-genomic era. The provision of clear and intuitive models of complex information is surprisingly challenging, and this paper presents conceptual models for a range of important emerging information resources in bioinformatics. It is hoped that these can be of benefit to bioinformaticians as they attempt to integrate genetic and phenotypic data with that from genomic sequences, in order to both assign gene functions and elucidate the different pathways of gene action and interaction. Results: This paper presents a collection of conceptual (i.e. implementation-independent) data models for genomic data. These conceptual models are amenable to (more or less direct) implementation on different computing platforms. Availability: Most of the information models presented here have been implemented by the authors using an object database. The implementation of a public interface to this database is in progress. We hope to have a public release in the autumn of 2000, available from http:// img. cs.man.ac.uk/ gims.

ARMEDA II: supporting genomic medicine through the integration of medical and genetic databases

Proceedings. Fourth IEEE Symposium on Bioinformatics and Bioengineering, 2004

In this paper we present ARMEDA II, a project designed to integrate distributed heterogeneous medical and genetic databases in support of genomic medicine. In this project, we have followed a "virtual repository" or VR approach. Although VRs are entities that do not contain any data, but metadata, they give users the perception of being working with local repositories that integrate data from different and remote sources. Our approach is based on two basic operators employed to connect new databases to the system: mapping and unification. The mapping process produces what is called the "virtual conceptual schema" of the newly created VR while the unification process provides tools to create an integrated virtual schema for at least two pre-existing VRs. We tested the current implementation of ARMEDA II using two tumor databases, one containing information from a hospital and the other containing genetic data associated to the tumor samples. The performance of the system was also evaluated using a pre-created set of 30 queries. For all queries the test yielded promising results since the system successfully retrieved the correct information. The ARMEDA II project is the current version of an ongoing project developed in the framework of an European Commission funded project.

Diagen: A Model-Driven Framework for Integrating Bioinformatic Tools (original) (raw)

Related papers