MIPS: a database for protein sequences, homology data and yeast genome information (original) (raw)

Abstract

The MIPS group (Martinsried Institute for Protein Sequences) at the Max-Planck-Institute for Biochemistry, Martinsried near Munich, Germany, collects, processes and distributes protein sequence data within the framework of the tripartite association of the PIR-International Protein Sequence Database (1,2). MIPS contributes nearly 50% of the data input to the PIR-International Protein Sequence Database. The database is distributed on CD-ROM together with PATCHX, an exhaustive supplement of unique, unverified protein sequences from external sources compiled by MIPS. Through its WWW server (http://www.mips.biochem.mpg.de/) MIPS permits internet access to sequence databases, homology data and to yeast genome information. (i) Sequence similarity results from the FASTA program (3) are stored in the FASTA database for all proteins from PIR-International and PATCHX. The database is dynamically maintained and permits instant access to FASTA results. (ii) Starting with FASTA database queries, proteins have been classified into families and superfamilies (PROT-FAM). (iii) The HPT (hashed position tree) data structure (4) developed at MIPS is a new approach for rapid sequence and pattern searching. (iv) MIPS provides access to the sequence and annotation of the complete yeast genome (5), the functional classification of yeast genes (FunCat) and its graphical display, the ‘Genome Browser’ (6). A CD-ROM based on the JAVA programming language providing dynamic interactive access to the yeast genome and the related protein sequences has been compiled and is available on request.

Description

Data collection and processing of protein sequences

MIPS is responsible for collecting protein sequence data from European sources for the common Protein Sequence Database of PIR-International (1,2) by scanning major European journals and by translation of nucleic acid sequence data received on a day-by-day basis from the EBI (7). In addition, sequence data generated in the European sequencing projects for Saccharomyces cerevisiae and Arabidopsis thaliana are processed and analyzed. The resulting protein sequences are processed for PIR-International and the nucleic acid sequence data are forwarded to the EBI. As soon as they are captured, protein sequences are compared with the complete set of published proteins and the results are incorporated into the FASTA database (see below). After data verification sequences are rapidly added to the PIR-International Protein Sequence Database (1,2). In a second step, sequences are annotated. During this process they are scrutinized for overlaps with existing database entries and merged if identity with the matching database object has been confirmed. The established nomenclature of PIR-International is rigorously applied for the annotation of protein names, species, keywords and features including posttranslational protein modifications and homology domains.

To allow for strong data typing, we have introduced the CO2 format (8) for data handling, supported by a commercial object-oriented database management system. Data forwarding to support synchronized copies of the database on wide area network is currently under investigation.

As a supplement to the Protein Sequence Database, we have created PATCHX, a set of unique, unverified protein sequences built from external sources, e.g. automatic translations of nucleic acid sequences from the EBI Data Library (7), translations contained in GenBank (9), and sequences from SwissProt (10). Sequences that occur in the PIR-International Protein Sequence Database are excluded from PATCHX. A large fraction of PATCHX are sequences with minor differences to entries from the Protein Sequence Database. These are largely due to inconsistencies between the published and the submitted version of a sequence. Current efforts are dedicated to reduce the number of entries in PATCHX to a minimum and to parse the different entry formats into a homogeneous database.

Sequence homology data: FASTA database, PROT-FAM and HPT

Sequence similarity is the most powerful tool for sequence data analysis. To support the annotation efforts of MIPS an up-to-date database of sequence similarities has been built based on the FASTA algorithm for sequence database searches (3). FASTA results can be retrieved within milliseconds. The FASTA database, introduced in 1991, is updated with every new sequence added to the database in a reciprocal manner. Owing to the symmetric relation of the one-to-one sequence comparison, all matches of the new sequence are incorporated into the existing FASTA results of older entries. Thus, the database is always up-to-date. To allow for a compact representation the entry title is stored separately. The output for queries to the FASTA database can be customized according to the users needs (e.g. cutoff, maximum number of hits, database, etc.). FASTA results are raw data of sequence similarity. The superfamily classification effort of PIR-International provides access to validated sequence homology information (11,12) at the level of the entire protein and at the level of the homology domain. Classification results are displayed as multiple sequence alignments. PROT-FAM permits access to nearly 10 000 multiple alignments through the World Wide Web (see below).

The superfamily concept developed in the mid 1970s (13–15) states that homologous proteins can be assigned to protein superfamilies. Members of a superfamily have diverged from a common ancestral form. The original concept of the ‘protein superfamily’ is not applicable to multidomain proteins and thus was recently extended (11,12). The extended superfamily concept permits classification at two levels: the level of the ‘complete protein’ and the level of the ‘homology domain’.

Complete proteins are classified into ‘homeomorphic protein superfamilies’. Two proteins belong to the same homeomorphic protein superfamily when they are homologous over all of their sequence from the amino to the carboxyl end. ‘Homology domains’ are regions of local similarity contained in otherwise unrelated proteins, e.g. ‘protein kinase homology’ or ‘trypsin homology’. Regions of local similarity repeated within a single protein are also classified at the level of the ‘homology domain’, e.g. ‘ADP,ATP carrier protein repeat homology’.

Classification at the level of the complete protein permits to partition all the Protein Sequence Database into independent, nonoverlapping groups of entries. Each completely sequenced protein belongs to exactly one homeomorphic protein superfamily. The condition that members of a homeomorphic protein superfamily must be homologous over their entire sequence length implicates that all members must contain the same homology domains in the same order. For practical reasons we use ‘sequence similarity’ as main criterion to discriminate between homologous and non-homologous sequences. To avoid false positive assignments we define stringent conditions as criteria for sequence homology in routine work: (i) 30% sequence identity; (ii) at least 100 residues in length; and (iii) free of composition bias. More distantly related proteins may be clustered into the same homeomorphic protein superfamily after detailed sequence analysis or when homology is supported by non-sequence data, e.g. structural information.

‘Protein families’ are defined as sets of proteins under the even more stringent condition that each member of the family has more than 50% identity to at least one other member of the family. A superfamily is a union over families.

As of September 1996 the PIR-International Protein Sequence Database contains 90 000 entries. Of these, 68 000 (75%) entries have been classified into 26 000 protein families. 64 000 (71%) entries are finally classified. The remaining sequences are 4000 (4%) fragments that cannot be unambiguously classified. Protein families vary in size from one to several hundered members. Seventy-three percent (47 000 sequences) are present in the 8000 families with at least two members. The residual 17 000 sequences (27%) represent distinct protein families. Multiple sequence alignments and sequence profiles have been computed using the GCG sequence analysis software (16) programs PILEUP and PROFILEMAKE.

Sixty percent (38 000 of 64 000) of the finally classified entries have been grouped into 4100 superfamilies. Of these, 2800 (68%) are based on a single protein family whereas 1300 (32%) contain sequences from more than one protein family. Multiple alignments for the later set are also precomputed for rapid inspection.

Homology domains are annotated in the ‘superfamily’ field and as ‘domain’ feature in the PIR-International Protein Sequence Database. Superfamily names for homology domains contain the term ‘homology’. The ‘domain’ feature annotation contains exactly the same name and indicates sequence coordinates. Although different authors usually agree that a sequence contains a certain homology domain, the assignment of domain boundaries is rather subjective. To avoid inconsistencies, we set boundaries close to well-conserved regions. MIPS extracts all homology domains annotated as domain feature from the Protein Sequence Database into a specific homology domain sequence database called HOMDOM. 17 000 individual domain features are annotated for the 285 distinct homology domains. MIPS screens for yet unannotated occurences of homology domains and adds the corresponding domain feature annotation to the database entries.

The HPT (hashed position tree) is an index data structure for improved performance of sequence comparisons. The index is used to preprocess a data set which largely reduces the computational complexity of sequence comparisons. Various applications using the HPT can be formulated. (i) The all-against-all matching of the 12 million bases of the yeast genome on DNA and protein level could be done in less than 48 h on a single DEC workstation. (ii) A WWW interface is available that allows to compare a query sequence with the HPT-indexed dataset of the 6274 ORFs of the yeast proteome. (iii) The yeast proteins have been compared against the more than 2.2 million translated human ESTs. (iv) A prototype version of the PIR-International Protein Sequence Database indexed by HPT is available to search for amino acid patterns and protein sequences.

The complete genome of Saccharomyces cerevisiae

By April 24, 1996, the sequence of the yeast genome was completed as the result of a world-wide collaboration among European, Swiss, UK, American, Canadian and Japanese laboratories. Sixteen chromosomes, not including rDNA repeats, of 12 million bases of DNA code for 6274 proteins. MIPS has served as the informatics centre for the European effort and assembled more than 6 million bases of data submissions into contiguous chromosomal sequences. Data have been annotated and organized in a database of the yeast genome accessible through the WWW, including information from other, specialized yeast databases. In addition to the information available in the PIR-International Protein Sequence Database the yeast database contains detailed information on specific properties like codon adaptation bias, disruptants and motifs. Links to the corresponding sequence databases, PROT-FAM, the FASTA database and YPD (17) are implemented.

Intuitive visual access to large volumes of data is indispensible for systematic analysis of large scale genomic data, e.g. to inspect the 12 million bases of the complete yeast genome. Correlation of independent findings becomes apparent only if displayed in a coherent and well structured way. The WWW-based genome browser permits to visualize various aspects of the genome in an interactive way, e.g. display of the set of all sequence similarities within the whole genome as computed using HPT or display of all proteins that belong to a selected functional category. The user may specify a specific view of the genome as a declarative query. The result of the request is an image that can be inspected at variable resolutions. Any detected genetic element can be used as entry-point to the yeast information system provided by MIPS.

A version of the genome browser and the yeast sequence database will be available on CD-ROM. This enables every scientist to access this tool independent of network resources. The standard browser technology can be used for local data processing, as the program was written in the novel internet programming language JAVA and all documents are stored in HTML. The CD-ROM will contain the complete yeast genome and its annotation. The appropriate system-independent software to navigate and query interactively will be provided.

MIPS WWW services

The MIPS server attempts to integrate data and services to ease the access to our resources. Database services vary widely in: (i) the type of data retrieved or explored; (ii) their temporal behavior; (iii) their principal type of operation as stateless or state dependent; and (iv) the platform on which they reside. Layered software architecture was established to hide the heterogeneity of services from the user. Such an architecture was implemented using client/server communication to distribute and schedule tasks over the local network of workstations entirely transparent to the user (18).

Data and services offered by the WWW server focus on data and services uniquely supplied by MIPS. These include a WWW interface to the multi-database/multifield query system ATLAS allowing format independent access and retrieval to 74 indexed databases totaling more than 2 million entries. The MIPS WWW site gives access to the PROT-FAM project with nearly 10 000 multiple sequence alignments at the level of the protein family (8000 alignments), protein superfamily (1200 alignments), or homology domain (285 alignments). It is possible to align a query sequence against a sequence profile derived from the multiple alignment. Access to the yeast genome browser and several applications based on HPT are also available through the WWW.

How to contact MIPS: Münchner Informationszentrum für Proteinsequenzen, Max-Planck-Institut für Biochemie, D-82152 Martinsried bei München, Germany; Tel +49 89 8578 2656; Fax +49 89 8578 2655; Email mewes@mips.embnet.org

Acknowledments

MIPS is supported by the Max-Planck-Gesellschaft, the Forschungszentrum f. Umwelt und Gesundheit (GSF) and the European Commission BRIDGE Grants BIOT-CT-0167 and 0172.

References

1

,

Nucleic Acids Res.

,

1996

, vol.

24

(pg.

17

-

20

)

2

et al. ,

Nucleic Acids Res.

,

1997

, vol.

25

(pg.

24

-

27

)

3

,

Science

,

1985

, vol.

227

(pg.

1435

-

1441

)

4

,

Proceedings of the Third South American Workshop on String Processing

,

1996

Ottawa

Carleton University Press

(pg.

101

-

114

)

5

et al. ,

Science

,

1996

in press

6

,

Proceedings Fourth International Conference on Intelligent Systems for Molecular Biology

,

1996

Menlo Park, California

AAAI Press

(pg.

98

-

108

)

7

,

Methods Enzymol

,

1996

, vol.

266

(pg.

3

-

27

)

8

,

Protein Seq. Data Anal.

,

1993

, vol.

5

(pg.

357

-

399

)

9

,

Nucleic Acids Res.

,

1996

, vol.

24

(pg.

1

-

5

)

10

,

Nucleic Acids Res.

,

1996

, vol.

24

(pg.

21

-

25

)

11

,

Methods Enzymol.

,

1996

, vol.

266

(pg.

59

-

71

)

12

,

Methods in Protein Structure Analysis

,

1995

New York

Plenum Press

pg.

473

13

,

Naturwissenschaften

,

1975

, vol.

62

pg.

154

14

,

Fed. Proc.

,

1976

, vol.

33

pg.

2314

15

,

J. Mol. Evol.

,

1975

, vol.

7

pg.

1

16

Program Manual for the Wisonsin Package

,

1994

9

575 Science Drive, Madison, Wisconsin, USA 53711

Genetics Computer Group

Version 8

17

,

Nucleic Acids Res.

,

1996

, vol.

24

(pg.

46

-

49

)

18

,

Proteins to Cell Metabolism

,

1995

Braunschweig, Germany

GBF Monographs

pg.

18

© 1997 Oxford University Press