The Rash Process (original) (raw)

The RASH process describes a methodology for resolving semantic heterogeneieties between bioinformatics resources. The process starts with choosing resources, followed by making implicit schema explicit (where necessary), developing a unifying schema (if desired), a schema comparison stage and then resolution of conflicts found. The information about a resource's schema and equivalencies and conflicts found are stored in RASHdb, whose schema is a model of what information appears in a database schema. In fact, the RASH process is a methodology for populating RASHdb. It is this database that is the core of the BioCOMPASS.

The case study using the SWISS-PROT and PIR protein sequence databanks will be used to illustrate the RASH process. It might be more realistic, in terms of everyday bioinformatics tasks, to have as an example the reconciliation of several schema fragments. This particular case study was, however, chosen for its simplicity, whilst revealing some common features of the RASH process.

This complete reconciliation and integration of PIR and SWISS-PROT will take place under the following scenario: TAMBIS allows queries to be formed over multiple resources, but only one type of each resource may be queried. For example, only one protein sequence database can be used to answer queries concerning the concept `protein'. To relax this single source assumption, a common view (schema) is needed for the multiplicity of protein sequence databases. In addition, it would be useful to remove any redundancy in the answers to queries -- the same protein appearing more than once.

both these resources exhibit a common feature of bioinformatics resources -- they are, or appear to be, flat-file resources. In flat-file databanks, the schema is implicit. One major task within the RASH process is to make this implicit schema manifest.

The BioCOMPASS is used to manage the RASH process. Several assumptions and requirements have been declared for the RASH process -- these are stated to help ensure that the RASH process is appropriate to its task. At each stage of the process data is entered into the RASHdb that sits behind the BioCOMPASS. The process outlined below is deceptively simple -- the devil lies in the detail. the principal points of each stage are given and links provided to further detail and illustrations from the case study.

  1. Resource identification
    Before anything else can take place, the resources that will participate in the RASH process must be identified. Obviously, the BioCOMPASS is one avenue for resource identification -- through its biology topic queries. Otherwise, web searches can yield the bioinformatics resource you require. three useful web resources are: dbCAT, Amos' Web Linksand Molecular biology database index. Expasy's BioHunt offers a route to search for molecular biology resources. Finally, the January issue of each year's nucleic Acids Research is dedicated to short articles on bioinformatics resources.
    • Write a scenario describing why the RASH process is being employed. For example see the TAMBIS based scenario for using SWISS-PROT and PIR given above;
    • Identify the resources that contain the information required;
    • Record the name, location and version of the resource in RASHdb;
    • Record the author of this RASH process and the date upon which it commenced in RASHdb;
    • Collect any documentation for the resource -- user guide, any available schema (including those from a third party) and record ddetails of the documentation in RASHdb;
    • Identify which portions of the resources required for the purpose of performing the RASH process.
      The BioCOMPASS will guide the user to submit these data to the RASH management part of RASHdb. The principle task of the BioCOMPASS is to populate the RASHdb and it supports the RASH process.
  2. Schema manifestation
    Many bioinformatics resources have no, or appear to have no, schema. this is usually because the resource is available, or appears to be available as, as a flat-file. Otherwise, the resource may be available via a web-based user interface. In such cases, it will be necessary to develop a explicit schema for the resource.
    If a schema is available, it could be in any of the following forms:
    • ER or EER schema;
    • A collection of relational tables;
    • An object orientated database schema;
    • An ACEdb schema.
      The BioCOMPASS can accomodate all of these forms. It may well be easier, however, to transform these schema representations into RASH's preferred representation EXPRESS, as submission to RASHdb can be semi-automatic.
    • Develop a schema for each of the resources identified;
    • Identify the portion of the schema required;
    • Record, if necessary, where in the original resource elements of the schema originated;
    • Record, if appropriate, the documentation associated with each schema element;
    • Record the data and biological intention for the schema elements.
      Two schema in EXPRESS for the primary case study can be found for SWISS-PROT and PIR. These schema were made explicit using the documentation available for these resources.
  3. Development of unifying schema
    There is necessarily a target schema for the reconciliation -- a schema to which the source schema must conform. One of the source schema can be promoted to be the unifying schema, e.g., SWISS-PROT is the unifying schema and PIR must be reconciled to that schema. Otherwise, either some intermediate, form or synthesis of the source schema, or a novel schema will be developed. A unifying schema for SWISS-PROT and PIR in EXPRESS can be viewed here. A unifying schema is not mandatory --it is possible to reconcile each of the source schema with each other. This may, however, be costly in effort.
    • Develop unifying schema;
    • Identify, if appropriate, from which of the source schema elements arose;
    • Add the schema to the RASHdb, including intention and management information.
  4. Schema comparison
    It is at this stage that semantic heterogeneities in the schema are identified and resolved. At this stage of the process, RASHdb contains separate entries for two or more schema. Inter-schema relationships need to be made that identify equivalent schema elements. Properties of these inter-schema relationships will describe the type of heterogeneiety and the mechanism by which it may be resolved. It is possible, however, for equivalent entities to be irreconcilable in one or both directions.
    Using the BioCOMPASS, form inter-schema relationships between equivalent schema elements;
    • The type of semantic heterogeneity found is recorded according to Won Kim's classification, which forms a part of the inter schema equivalence portion of the RASHdb schema;
    • A description is added for the semantic heterogeneity;
    • The mechanism for resolving the heterogeneiety is recorded;
      >
    • For each inter-schema relationship, the classification, description and method are recorded for each direction of the relationship;
    • If the equivalence is irreconcilable in one or both directions, then it should be recorded via the BioCOMPASS.
  5. Instance conflict resolution
  6. Querying and presentation
    Once all the information from this run of the RASH process has been gathered and entered, the BioCOMPASS can be used to answer RASH queries. It is possible, for instance, to recover a resource schema, together with a list of `instructions' on how to place data from another, equivalent, resource into that schema. The BioCOMPASS user interface and mode of action is described fully here.