Integr8 and Genome Reviews: integrated views of complete genomes and proteomes (original) (raw)

Abstract

Integr8 is a new web portal for exploring the biology of organisms with completely deciphered genomes. For over 190 species, Integr8 provides access to general information, recent publications, and a detailed statistical overview of the genome and proteome of the organism. The preparation of this analysis is supported through Genome Reviews, a new database of bacterial and archaeal DNA sequences in which annotation has been upgraded (compared to the original submission) through the integration of data from many sources, including the EMBL Nucleotide Sequence Database, the UniProt Knowledgebase, InterPro, CluSTr, GOA and HOGENOM. Integr8 also allows the users to customize their own interactive analysis, and to download both customized and prepared datasets for their own use. Integr8 is available at http://www.ebi.ac.uk/integr8.

INTRODUCTION

Since the advent of whole genome sequencing in the mid-1990s, the sequences of over 190 cellular organisms have been completely determined, annotated and deposited in the public repositories. The rate of deposition of such sequences is still increasing, with over 90 such genomes sequenced and made available since March 2003. The availability of these data has enabled the development of new ways to interpret information about individual genes and proteins in their biological context, and has underpinned the development of new experimental and theoretical fields such as transcriptomics, proteomics and systems biology. However, these new technologies have generated enormous quantities of data, meaning that the information needed to draw scientific conclusions is increasingly likely to be spread over many different primary resources, which do not necessarily maintain common identifiers for the data items they describe, or even agree on the definition of common terms (1). Coherently integrating such data, and offering access to it, has thus emerged as one of the most important challenges in bioinformatics. We have addressed these problems by releasing a new database, Genome Reviews, in which updated annotation is added to genomic sequence data; and a new web interface, Integr8, offering interactive access to data integrated from Genome Reviews and other resources, centred around organisms with completely sequenced genomes.

GENOME REVIEWS

Motivation

The International Nucleotide Sequence Database, a collaboration among the EMBL (2), GenBank (3) and DDBJ (4) nucleotide sequence databases, is the usual primary public repository for DNA and RNA sequence and annotation, including completed genome sequences. As repositories, these databases allow submitters to retain ownership of their own data and in consequence, annotation of different entries is often not standardized in format or update frequency (and as annotation for predicted genes is often inferred by similarity to other sequences, information can become out-of-date simply by the submission of additional entries to the databases). Additionally, theoretical annotation inferred from a sequence (that is commonly present in database submissions) is usually not well integrated with data derived from laboratory experiments. These issues can only be addressed through active curation of database entries; but information introduced into well-annotated resources such as the UniProt Knowledgebase (5) cannot be incorporated into the archive genome sequence record except at the instigation of the submitter. Improved genome annotation has been made available by RefSeq (6). However, these data use their own identifiers and do not necessarily contain cross-references to databases such as EMBL or UniProt.

Therefore, we have launched Genome Reviews, to make genome sequence available with standardized, up-to-date annotation while maintaining cross-references to the primary submission (and to entries in other databases with cross-references to it). To ensure compatibility with existing tools, Genome Reviews is distributed in an extended version of the flat file format used by EMBL. The initial scope of Genome Reviews is prokaryotic genomes. Release 7 (made on August 4, 2004) contains files for 187 chromosomes and 105 plasmids, representing the complete genomes of 170 species.

Propagation of data to Genome Reviews

EMBL entries describe nucleotide sequences, features (annotated regions of sequence) and feature qualifiers (individual annotations attached to a feature). Additionally, there is also some annotation that is attached to the database entry itself as opposed to the sequence (e.g. the database accession number). The ‘CDS’ (CoDing Sequence) feature is used to identify sub-sequences within the overall DNA sequence that encode a protein sequence (proteins are the most widely annotated biological entity); the ‘/db_xref’ qualifier indicates cross-references to entries in other databases. The ‘/protein_id’ qualifier uniquely identifies each CDS annotated in the entry.

There are therefore three ways in which an entry in another database can be identified as referring to the same biological entity as a given EMBL (CDS) feature: if the EMBL feature cross-references that entry, if that entry cross-references the EMBL feature or if an entry in a third database cross-references both other entries. By tracking identifiers between databases, additional annotation belonging to a feature can be identified. The UniProt Knowledgebase (5), a well-annotated resource in which redundant submissions are merged, is a particularly useful database hub for retrieving annotation and cross-links to further resources.

To produce Genome Reviews, a particular preferred source is nominated for each type of annotation; and annotation of that type is imported from that source into Genome Reviews either as a supplement to or as a replacement for the annotation in the original submission. Where more than one resource may provide annotation of a certain type (e.g. gene names), redundant data are case-standardized and merged.

The annotation attached to other types of features (e.g. non-coding RNAs) has also been standardized, and redundant or rarely used features and feature qualifiers removed. In addition to the insertion/deletion of feature qualifiers associated with existing features in the original submission, new features have been added (for example, regions of DNA encoding the mature peptides produced after proteolytic cleavage of the primary translations) by mapping features annotated on protein sequences onto corresponding regions of DNA. CDSs identified by UniProt curators as false (i.e. unlikely to encode a real protein) have been removed.

Genome Reviews has been implemented using the Java programming language in conjunction with a relational database management system. An extension of the open source BioJava (7) EMBL parser has been used to prepare the files for distribution.

Content and format changes

With the propagation of data from multiple sources into Genome Reviews, the EMBL flat file format has been extended to support the provision of clear evidence regarding the origins of each piece of data included in the file by the addition of evidence tags to feature qualifiers. Each evidence tag consists of the name of the database and where appropriate, the identifier of an entry within that database, from which data have been sourced. Additionally, new types of features and feature qualifiers have been introduced to describe imported data not previously present in EMBL entries. In spite of this, the number of different types of annotation has been reduced owing to the standardization of representation and the removal of redundant annotation. Some statistics indicating how Genome Reviews has been enhanced compared to the original submissions are given in Table 1. For example, the number of cross-references to other databases has been increased 4-fold and the number of cross-referenced databases 3-fold.

Table 1. Incorporation of new data and data types in Genome Reviews (compared with parent entries in the EMBL nucleotide sequence database).

| | Original EMBL entries | Genome Reviews entries | | | --------------------------------------------- | ---------------------- | --------- | | Number of feature types | 30 | 11 | | Number of qualifier types | 42 | 28 | | Number of feature qualifiers | 4 649 864 | 6 783 847 | | Number of external databases cross-referenced | 6 | 18 | | Number of ‘mat_peptide’ features | 0 | 3825 | | Number of ‘/db_xref’ qualifiers | 631 881 | 2 527 269 | | Number of ‘/locus_tag’ qualifiers | 367 771 | 384 899 | | Number of evidence tags | 0 | 5 474 235 |

The effect of these changes on (as part of) an individual EMBL entry can be seen in Figure 1, which illustrates the introduction of new feature and feature qualifier types, new data (added using existing EMBL features and feature qualifiers) and the use of evidence tags.

Figure 1.

Figure 1

Incorporation of new data into Genome Reviews. The figure shows a portion of Genome Reviews entry AL009126_GR from release 7.0. Data in boldface have been added to the corresponding portion of the original submission to the EMBL/GenBank/DDBJ nucleotide sequence databases.

INTEGR8

Aims and data sources

The Integr8 portal offers an overview of information about organisms with completely sequenced genomes; statistical analyses of their genomes and proteomes, individually and in comparison with each other, using various data from multiple resources; interactive interfaces allowing users to configure their own analyses; and access to the underlying data for download. Integr8 is built on three main data sources:

  1. Genome Reviews.
  2. Non-redundant sets of UniProt entries representing each complete proteome. For prokaryotic organisms, these are constructed according to the HAMAP (High-quality Automated and Manual Annotation of microbial Proteomes) specifications (8) used in the annotation of the UniProt Knowledgebase. Sets are also available for eukaryotic organisms, prepared by filtering the UniProt Knowledgebase using information from EMBL and model organism databases such as FlyBase (9). In some species (like human) where the level of multiple submissions is very high, additional entries are filtered out according to their sequence similarity.
  3. IPI (the International Protein Index) (10) provides comprehensive protein sets for certain higher metazoan species, by combining data from the UniProt Knowledgebase, Ensembl (11) and RefSeq (6) in a non-redundant fashion.

For each proteome set, additional information has been integrated from other resources such as HAMAP (8), InterPro (12), CluSTr (13), the Gene Ontology Annotation database (GOA) (14) and others. A full list of resources available through Integr8 is provided in Table 2.

Table 2. Resources integrated in Integr8.

A simple search form provides access to summary information relevant to each species. This information includes a description, a list of recent publications, a list of the components of its genome and information about the composition of these components (such as their length, average GC content and the length and codon usage in the CDSs they contain), represented textually or graphically as is appropriate. A typical page displaying such an analysis is shown in Figure 2.

Figure 2.

Figure 2

Genome Statistics for the fission yeast Schizosaccharomyces pombe, as represented in the Integr8 browser.

The search facility can also be used to search for proteins belonging to the non-redundant proteome sets, and the genes that encode them.

Proteome Analysis

Integr8 has incorporated the Proteome Analysis Database (15), to provide information about the composition of complete proteomes. Individual proteins are classified according to InterPro (12), GO (16) and CluSTr (13), and an overview of the composition of each proteome constructed on these criteria is available. For example, for each proteome, users can identify the most common protein families and domains, proteins without close relatives, or clusters of related proteins unclassified by InterPro. GO classifications for each species are summarized using a reduced set of high-level terms (GO Slim), presenting an overview of proteome function even in species where more specific annotations might not be available. A major advance in the past year has been the doubling in the coverage of CluSTr, a database that categorizes proteins into a hierarchy of clusters based on overall sequence similarity. Individual hierarchies of clusters have now been prepared for each of 109 proteomes, enabling the relationship of all paralogous proteomes to be analysed in these species. Additional structural data are also available based on information derived from the Protein Data Bank (17) and HSSP (18).

Integr8 also offers comparative analysis, whereby the composition of multiple proteomes can be compared. A total of 160 comparative analyses (each featuring between two and four related species) have been pre-compiled; additional comparisons can be specified interactively.

Users can also configure their own comparative analysis in two ways. First, it is possible to configure a multi-species analysis based on InterPro classifications (19). Second, the data from Integr8 have been loaded into an additional search tool, BioMart, a development of the EnsMart data warehousing system (20). BioMart provides the ability to search complete proteomes and genomes using combinatorial criteria, and to customize matching data for download. BioMart also supports interactive querying between the Integr8 data and other resources, such as ArrayExpress (21), Ensembl (11) and the European Macromolecular Structure Database (22).

Downloads

The following data are available for download from Integr8: (i) Genome Reviews files; (ii) UniProt complete proteome sets; (iii) IPI datasets; (iv) Files of InterPro matches for each proteome set; and (v) ‘Chromosome tables’, summary files mapping proteins represented in UniProt to their genomic locations. As noted above, users can additionally customize their own data for download through BioMart.

FUTURE DEVELOPMENTS

In Genome Reviews, we address the problem of annotations describing DNA sequence features being out of date or incorrect; but not the problem that the DNA sequence features themselves may be incorrect or absent. However, the use of techniques such as statistical (23) or proteomic (24) analysis has indicated that a substantial number of gene predictions may not encode real genes [in the case of one genome, the number of false CDSs has been estimated at 50% (25)], and that other genes have not been described. We are therefore developing methods to map protein sequences not annotated in the original EMBL genome entries (which may represent corrected versions of originally annotated protein sequences, or novel protein sequences subsequently, experimentally determined or predicted by alternative methods) onto their corresponding genomes. In most cases, it is possible to identify (by sequence similarity searching) a putative sequence in the genomic DNA that encodes this protein (and also to describe any difference between its translation and the actual protein sequence, thereby explaining why the annotation was originally not made). In future releases of Genome Reviews, additional CDSs representing unannotated protein sequences obtained from trusted sources (including, but not limited to, the UniProt Knowledgebase) will be added to the Genome Reviews files, enabling the provision of a consistent view of the genome and proteome of each organism. We are also investigating the possibility of generating new predictions for non-coding RNA genes in a standardized fashion for all genomes. For all new features, sequence discrepancies will be annotated and the use of evidence tags will allow the source of new data to be clearly identified.

Genome Reviews files are currently available for all prokaryota, which account for over 80% of the annotated genomes currently in the public databases. It is planned to extend Genome Reviews to lower metazoan organisms in the near future. In the case of higher metazoan species, different types of problem are typically encountered, such as incomplete or unfinished sequence or annotation (as opposed to the complete, but out-of-date, information often associated with data from simpler species); and gene structure is typically more complex and less well-determined. The best current interpretation of such genomes can be found in dedicated resources such as Ensembl (11). Genome Reviews will complement Ensembl, and will not extend to higher metazoan species.

AVAILABILITY

Updated versions of Genome Reviews and Integr8 are released on a bi-weekly schedule, in synchrony with releases of the UniProt Knowledgebase (5). Genome Reviews is available from its own website (http://www.ebi.ac.uk/GenomeReviews) or through the Integr8 site (http://www.ebi.ac.uk/integr8). Files can also be downloaded via the respective FTP sites (ftp://ftp.ebi.ac.uk/pub/databases/genome_reviews, ftp://ftp.ebi.ac.uk/pub/databases/integr8). BioMart is available at http://www.ebi.ac.uk/biomart.

Acknowledgments

ACKNOWLEDGEMENTS

This work has been funded by the award of grant number QLRI-CT-2001000015 from the European Union under the RTD program ‘Quality of Life and Management of Living Resources’.

REFERENCES