The SWISS-PROT protein sequence data bank and its supplement TrEMBL in 1999 (original) (raw)

Abstract

SWISS-PROT is a curated protein sequence database which strives to provide a high level of annotation (such as the description of the function of a protein, its domain structure, post-translational modifications, variants, etc.), a minimal level of redundancy and high level of integration with other databases. Recent developments of the database include: cross-references to additional databases; a variety of new documentation files and improvements to TrEMBL, a computer annotated supplement to SWISS-PROT. TrEMBL consists of entries in SWISS-PROT-like format derived from the translation of all coding sequences (CDS) in the EMBL nucleotide sequence database, except the CDS already included in SWISS-PROT. The URLs for SWISS-PROT on the WWW are: http://www.expasy.ch/sprot and http://www.ebi.ac.uk/sprot

Introduction

SWISS-PROT (1) is an annotated protein sequence database, which was created at the Department of Medical Biochemistry of the University of Geneva and has been a collaborative effort of the Department and the European Molecular Biology Laboratory (EMBL), since 1987. SWISS-PROT is now an equal partnership between the EMBL and the newly created Swiss Institute of Bioinformatics (SIB). The EMBL activities are carried out by its Hinxton Outstation, the European Bioinformatics Institute (EBI) (2).

The SWISS-PROT protein sequence data bank consists of sequence entries. Sequence entries are composed of different line types, each with their own format. For standardisation purposes the format of SWISS-PROT follows as closely as possible that of the EMBL Nucleotide Sequence Database. A sample SWISSPROT entry is shown in Figure 1.

The SWISS-PROT database distinguishes itself from other protein sequence databases by three distinct criteria: (i) annotation, (ii) minimal redundancy and (iii) integration with other databases.

Annotation

In SWISS-PROT two classes of data can be distinguished: the core data and the annotation. For each sequence entry the core data consists of the sequence data; the citation information (bibliographical references) and the taxonomic data (description of the biological source of the protein), while the annotation consists of the description of the following items:

We try to include as much annotation information as possible in SWISS-PROT. To obtain this information we use, in addition to the publications reporting new sequence data, review articles to periodically update the annotations of families or groups of proteins. We also make use of external experts, who have been recruited to send us their comments and updates concerning specific groups of proteins.

We believe that our having systematic recourse both to publications other than those reporting the core data and to subject referees represents a unique and beneficial feature of SWISSPROT. In SWISS-PROT, annotation is mainly found in the comment lines (CC), in the feature table (FT) and in the keyword lines (KW). Most comments are classified by ‘topics’; this approach permits the easy retrieval of specific categories of data from the database.

Minimal redundancy

Many sequence databases contain, for a given protein sequence, separate entries which correspond to different literature reports. In SWISS-PROT we try as much as possible to merge all these data so as to minimise the redundancy of the database. If conflicts exist between various sequencing reports, they are indicated in the feature table of the corresponding SWISS-PROT entry.

Figure 1

A sample entry from SWISS-PROT.

Integration with other databases

It is important to provide the users of biomolecular databases with a degree of integration between the three types of sequencerelated databases (nucleic acid sequences, protein sequences and protein tertiary structures) as well as with specialised data collections. Cross-references are provided in the form of pointers to information related to SWISS-PROT entries and found in data collections other than SWISS-PROT. For example the sample sequence shown in Figure 1 contains, among others, DR (Data bank Reference) lines that point to EMBL, PDB, OMIM and PROSITE. In this particular example it is therefore possible to retrieve the nucleic acid sequence(s) that codes for that protein (EMBL), the description of genetic disease(s) associated with that protein (OMIM), the 3D structure (PDB) or the pattern specific for that family of proteins (PROSITE).

Recent Developments

Model organisms

We have selected a number of organisms that are the target of genome sequencing and/or mapping projects and for which we intend to:

The organisms currently selected are: Arabidopsis thaliana (mouse-ear cress), Bacillus subtilis, Caenorhabditis elegans (worm), Candida albicans, Dictyostelium discoideum (slime mold), Drosophila melanogaster (fruit fly), Escherichia coli, Haemophilus influenzae, Helicobacter pylori, Homo sapiens (human), Methanococcus jannaschii, Mus musculus (mouse), Mycobacterium tuberculosis, Mycoplasma genitalium, Saccharomyces cerevisiae (budding yeast), Salmonella typhimurium, Schizosaccharomyces pombe (fission yeast) and Sulfolobus solfataricus.

Table 1 lists, for each of the above model organisms, the name of the specialised database to which cross-references are available, the name of the SWISS-PROT index file and the number of sequences in SWISS-PROT.

Collectively these organisms represent about 40% of the total number of sequence entries in SWISS-PROT. We are currently attempting to finish the integration into SWISS-PROT of all the putative proteins from E.coli, B.subtilis, M.jannaschii and yeast.

New model organisms will soon be added to the list, these will include at least one additional archebacterial species, a cyanobacteria (probably Synechocystis sp. PCC 6803) and a plant (probably maize).

Documentation files

SWISS-PROT is distributed with a large number of documentation files. Some of these files have been available for a long time (the user manual, release notes, the various indices for authors, citations, keywords, etc.), but many have been created recently and we are continuously adding new files. Table 2 lists all the documents that are currently available.

Table 1

Model organisms in SWISS-PROT

New cross-references

We have recently added cross-references that link SWISS-PROT to the Pfam Protein families' database of alignments and HMMs (3).

Currently, SWISS-PROT is linked to 29 different databases and has consolidated its role as the major focal point of biomolecular databases interconnectivity. In release 36, there is an average of 3.5 cross-references for each sequence entry.

The ‘explicit’ links stored in the ‘DR’ lines of the flat file version of SWISS-PROT are supplemented by an additional category of links that we term ‘implicit’. Implicit links are only available through the ExPASy www version of SWISS-PROT (see the practical information section) and are automatically generated by the server software. They further enhance the interoperability offered by SWISS-PROT by allowing users to navigate through additional and complementary information resources. There are two broad categories of implicit links as outlined below.

Table 2

List of documents available in SWISS-PROT

While implicit links are quite useful, one must remember that:

TrEMBL—a computer annotated supplement to SWISS-PROT

Introduction. Due to the increased data flow from genome projects to the sequence databases we face a number of challenges to our way of database annotation. Maintaining the high quality of sequence and annotation in SWISS-PROT requires careful sequence analysis and detailed annotation of every entry. This is the rate-limiting step in the production of SWISS-PROT. On one hand we do not wish to relax the high editorial standards of SWISS-PROT and it is clear that there is a limit to how much we can accelerate the annotation procedures. On the other hand, it is also vital that we make new sequences available as quickly as possible. To address this concern, we introduced in 1996 TrEMBL (Translation of EMBL nucleotide sequence database). TrEMBL consists of computer-annotated entries derived from the translation of all coding sequences (CDS) in the EMBL database, except for CDS already included in SWISS-PROT.

Current status. In August 1998, TrEMBL release 7 was produced. Release 7 was based on the translation of all 327 000 CDS in the EMBL Nucleotide Sequence Database release 55. Around 109 000 of these CDS were already as sequence reports in SWISS-PROT and thus excluded from TrEMBL. The remaining 218 000 sequence entries have been automatically merged whenever possible to reduce redundancy in TrEMBL. This step led to 193 860 TrEMBL entries.

We have split TrEMBL into two main sections; SP-TrEMBL and REM-TrEMBL: SP-TrEMBL (SWISS-PROT TrEMBL) contains the entries (165 420 in release 7) which should be incorporated into SWISS-PROT. SWISS-PROT accession numbers have been assigned to these entries. SP-TrEMBL is partially redundant against SWISS-PROT, since ∼40 000 of these entries are only additional sequence reports of proteins already in SWISS-PROT. For TrEMBL to act as a computer-annotated supplement to SWISS-PROT, new procedures have been introduced to remove redundancy and to automatically add highly reliable annotation.

The first step is the reduction of redundancy. All full-length proteins in SP-TrEMBL with the same sequence are merged into one entry. All fragment proteins with the same sequence from the same organism are merged, provided they do not belong to a highly variable category of proteins like MHC proteins or viral proteins. For all SWISS-PROT entries, the CRC32 checksums of all the different annotated sequence reports are calculated and compared with the checksums of all SP-TrEMBL entries. Identified matches are removed from SP-TrEMBL and integrated into the corresponding SWISS-PROT entries. Merging sub-fragments with full-length sequences and conflicting sequence reports about the same sequence further reduces the redundancy. Although these merging operations are automated, all merged entries are finally checked by biologists to avoid the merging of sequences from two different but highly similar genes into one entry. We use LASSAP (8) to identify sub-fragments to be merged with full-length sequences and to identify conflicting sequence reports about the same sequence. This new set of matches is removed from SP-TrEMBL and integrated into the corresponding SWISS-PROT or SP-TrEMBL entries.

The second post-processing step is the information enhancing process. All SP-TrEMBL entries are scanned for PROSITE patterns (9). If a matching pattern is found, a three-step procedure is used to reduce the number of false positive hits. Firstly, the taxonomic classification of the SP-TrEMBL entry must be within the known taxonomic range of the PROSITE pattern. For instance, a match of an a priori prokaryotic pattern against a human protein is regarded as false positive and filtered out. Secondly, the significance of the PROSITE pattern match is checked. This is done by a second check of the SP-TrEMBL sequence with a set of secondary patterns derived from the PROSITE pattern. These secondary patterns are computed with the eMotif algorithm (10). The PROSITE database contains a list of all SWISS-PROT proteins that are true members of the relevant protein family. For each pattern, the true positive sequences are aligned and fed into eMotif, which computes a nearly optimal set of regular expressions, based on statistical rather than biological evidence. We used a stringency of 10-9, so that each eMotif pattern is expected to produce on random a false positive hit in 109 matches. Thirdly, in cases where a protein family is characterised by more than one PROSITE signature, all signatures must be found in the entry. For instance, bacterial rhodopsins have a signature for a conserved region in helix C and another signature for the retinal binding lysine. If an SP-TrEMBL entry matches only the helix-C-pattern, but not the retinal-binding pattern, it will not be regarded as a bacterial rhodopsin.

The raw PROSITE hits and all results of the confirmation steps are stored in a hidden section of the SP-TrEMBL entry, but only those hits that satisfy all confirmation conditions are made publicly visible in a DR PROSITE line. Approximately 35% of all SP-TrEMBL entries can be characterised by a PROSITE signature but only around 30% of all SP-TrEMBL entries are true positive matches. The characterisation based only on PROSITE patterns would lead to 10–20% of false positive assignments. The confirmation steps reduce the level of characterisation by nearly a third to 25%. At this stage, we achieve a level of less than 0.07% of false positive assignments.

Whenever an SP-TrEMBL entry is recognised by our procedures as a true member of a certain protein family, annotation about the potential function, active sites, cofactors, binding sites, domains, subcellular locations is added to the entry. The main source of the annotation is compiled by extracting the annotation that is common to all SWISS-PROT entries of the relevant protein family. For every protein family, a ‘virtual SWISS-PROT entry’ is created computationally, which is based on the specific annotation valid for all SWISS-PROT members of this family. If we are sure that a new SP-TrEMBL protein belongs to a certain family, we can immediately transfer the annotation of the virtual entry for this family. The annotation is flagged as annotation based on comparative analysis (‘BY SIMIILARITY’).

The ‘virtual SWISS-PROT entries’ have a far-reaching effect on SP-TrEMBL. For example, the virtual entry for Rubisco affects more than 2000 SP-TrEMBL entries. Therefore we developed a system to decompose these virtual entries into rules, which are stored in a relational database. This rule-based system enables us to express the membership criteria for each protein family in a formal language. Furthermore, subfamilies have been introduced to meet the SWISS-PROT standard more closely. For instance, the ribosomal protein L1 family is found in all known species, but the annotation added to SP-TrEMBL entries of this family obviously depends on the taxonomic kingdom. The description reads ‘50S RIBOSOMAL PROTEIN L1’ for prokaryotes, archaebacteria, chloroplasts and cyanelles, and ‘60S RIBOSOMAL PROTEIN L10A’ for non-chloroplast encoded proteins of eukaryotes.

We also use the ENZYME database (11), using the EC number as a reference point, to generate standardised description lines for enzyme entries and to allow information such as catalytic activity, cofactors and relevant keywords to be taken from ENZYME and to be added automatically to SP-TrEMBL entries. Furthermore we use specialised databases like FlyBase (12) and MGD (13) to transfer information such as the correct gene nomenclature and cross-references to these databases into SP-TrEMBL entries. The automatic analysis and annotation of TrEMBL entries is redone and updated at every TrEMBL release.

REM-TrEMBL (REMaining TrEMBL) contains the entries (about 28 440 in release 7) that we do not want to include in SWISS-PROT. This section is organised into five subsections:

Practical Information

The use of SWISS-PROT is free for academic users. However, we implemented in September 1998 a system of annual subscription fee for commercial users of the database. The SIB and the EMBL/EBI mandated a new company, Geneva Bioinformatics (GeneBio) (see http://www.genebio.com) to act as their representative for the purpose of concluding the necessary license agreements and levying the fees. The funds raised will be used at SIB and the EBI to bring SWISS-PROT up to date, to keep it up to date, and to further enhance its quality. Further information on this new system is available from the www addresses: http://www.expasy.ch/announce/ and http://www.ebi.ac.uk/news.html

Content of the current SWISS-PROT release

Currently (November 1998), SWISS-PROT contains ∼76 000 sequence entries, comprising 27.2 million amino acids abstracted from ∼60 000 references. The data file (sequences and annotations) requires 155 Mb of disk storage space. The documentation and index files require ∼55 Mb of disk space.

Interactive access to SWISS-PROT and TrEMBL

The most efficient and user-friendly way to browse interactively in SWISS-PROT or TrEMBL is to use the World-Wide Web (www) molecular biology server ExPASy (14) as well as the one developed by the EBI. The ExPASy Web server was made available to the public in September 1993. In October 1998 a cumulative total of 34 million connections was attained. It may be accessed through its URL, which is: http://www.expasy.ch/

The EBI server is accessible under: http://www.ebi.ac.uk/

On both the ExPASy and the EBI Web servers, you can use the Sequence Retrieval System (SRS) (15) software package to query and retrieve sequence entries. The EBI and SIB also offer a range of search services to run Smith-Waterman, FASTA and BLAST sequence similarity searches against SWISS-PROT and TrEMBL.

How to obtain the full SWISS-PROT and/or TrEMBL releases

SWISS-PROT + TrEMBL is distributed on CD-ROM by the EMBL Outstation—the European Bioinformatics Institute (EBI) (2). The CD-ROMs contain SWISS-PROT + TrEMBL, the EMBL Nucleotide Sequence Database as well as other data collections and some database query and retrieval software for MS-DOS and Apple Macintosh computers. For all enquiries regarding the subscription and distribution of SWISS-PROT + TrEMBL one should contact: The EMBL Outstation—The European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK. Tel: (+44 1223) 494 444; Fax: (+44 1223) 494 468; Email: datalib@ebi.ac.uk.

If you have access to a computer system linked to the Internet you can obtain SWISS-PROT using anonymous FTP (File Transfer Protocol), from the following file servers: ftp.expasy.ch and ftp.ebi.ac.uk

How to submit data or updates/corrections to SWISS-PROT

To submit new sequence data to SWISS-PROT and for all enquiries regarding the submission of SWISS-PROT one should contact: SWISS-PROT, The EMBL Outstation—The European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK. Tel: (+44 1223) 494 462; Fax: (+44 1223) 494 468; Email: datasubs@ebi.ac.uk (for submission); datalib@ebi.ac.uk (for enquiries).

To submit updates and/or corrections to SWISS-PROT you can either use the Email address: swiss-prot@expasy.ch or the www address: http://www.expasy.ch/sprot/sp_update_form.html

Release frequency, weekly updates and non-redundant data sets

The current distribution frequency is four releases per year. Weekly updates are also available; these updates are available by anonymous FTP. For SWISS-PROT, three files are updated every week:

For TrEMBL, a file containing all the new entries since the last full release (trembl_new.dat) is updated every week.

These files are available on the EBI and ExPASy servers, whose Internet addresses are listed above.

Every week we also produce a complete non-redundant protein sequence collection by providing three compressed files (these are in the directory ‘/databases/sp_tr_nrdb’ on the ExPASy FTP server and in /pub/databases/sp_tr_nrdb on the EBI server): sprot.dat.Z, trembl.dat.Z and trembl_new.dat.Z.

This set of non-redundant files is especially important for two types of users:

Swiss-Shop

Swiss-Shop is an automated sequence alerting system which allows users to obtain, by Email, new sequence entries relevant to their field(s) of interest. Keyword-based and sequence/patternbased requests are possible. Every time a weekly SWISS-PROT release is performed, all new database entries matching the user-specified search keywords or patterns and the entries showing sequence similarities to the user-specified sequence will be sent automatically to the user by Email. Swiss-Shop requests can be submitted to: http://www.expasy.ch/swisshop/

References

1

,

Nucleic Acids Res.

,

1998

, vol.

26

(pg.

38

-

42

)

2

,

Nucleic Acids Res.

,

1998

, vol.

26

(pg.

8

-

15

)

3

,

Nucleic Acids Res.

,

1998

, vol.

26

(pg.

320

-

322

)

4

,

Nucleic Acids Res.

,

1998

, vol.

26

(pg.

323

-

326

)

5

,

Bioinformatics

,

1998

, vol.

14

(pg.

164

-

187

)

6

,

ISMB

,

1998

, vol.

6

(pg.

212

-

221

)

7

,

Electrophoresis

,

1997

, vol.

18

(pg.

2774

-

2780

)

8

,

Comput. Applic. Biosci.

,

1997

, vol.

13

(pg.

137

-

143

)

9

,

Nucleic Acids Res.

,

1997

, vol.

25

(pg.

217

-

221

)

10

,

Proc. Natl Acad. Sci. USA

,

1998

, vol.

95

(pg.

5865

-

5871

)

11

,

Nucleic Acids Res.

,

1996

, vol.

24

(pg.

221

-

222

)

12

Flybase Consortium

,

Nucleic Acids Res.

,

1998

, vol.

26

(pg.

85

-

88

)

13

the Mouse Genome Informatics Group

,

Nucleic Acids Res.

,

1998

, vol.

26

(pg.

130

-

137

)

14

,

Trends Biochem. Sci.

,

1994

, vol.

19

(pg.

258

-

260

)

15

,

Comput. Applic. Biosci.

,

1993

, vol.

9

(pg.

49

-

57

)

© 1999 Oxford University Press