InParanoid 7: new algorithms and tools for eukaryotic orthology analysis (original) (raw)
Journal Article
,
Search for other works by this author on:
,
Search for other works by this author on:
,
Search for other works by this author on:
,
Search for other works by this author on:
,
Search for other works by this author on:
,
Search for other works by this author on:
,
Search for other works by this author on:
Search for other works by this author on:
The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors.
Present address: Tina Köstler, Center for Integrative Bioinformatics Vienna, Max F. Perutz Laboratories, Dr Bohr Gasse 9, A-1030 Wien, Austria.
Received:
15 September 2009
Accepted:
08 October 2009
Published:
05 November 2009
Cite
Gabriel Östlund, Thomas Schmitt, Kristoffer Forslund, Tina Köstler, David N. Messina, Sanjit Roopra, Oliver Frings, Erik L. L. Sonnhammer, InParanoid 7: new algorithms and tools for eukaryotic orthology analysis, Nucleic Acids Research, Volume 38, Issue suppl_1, 1 January 2010, Pages D196–D203, https://doi.org/10.1093/nar/gkp931
Close
Navbar Search Filter Mobile Enter search term Search
ABSTRACT
The InParanoid project gathers proteomes of completely sequenced eukaryotic species plus Escherichia coli and calculates pairwise ortholog relationships among them. The new release 7.0 of the database has grown by an order of magnitude over the previous version and now includes 100 species and their collective 1.3 million proteins organized into 42.7 million pairwise ortholog groups. The InParanoid algorithm itself has been revised and is now both more specific and sensitive. Based on results from our recent benchmarking of low-complexity filters in homology assignment, a two-pass BLAST approach was developed that makes use of high-precision compositional score matrix adjustment, but avoids the alignment truncation that sometimes follows. We have also updated the InParanoid web site (http://InParanoid.sbc.su.se). Several features have been added, the response times have been improved and the site now sports a new, clearer look. As the number of ortholog databases has grown, it has become difficult to compare among these resources due to a lack of standardized source data and incompatible representations of ortholog relationships. To facilitate data exchange and comparisons among ortholog databases, we have developed and are making available two XML schemas: SeqXML for the input sequences and OrthoXML for the output ortholog clusters.
INTRODUCTION
Identifying orthologs is a critical goal in genomics, because orthologs, which are defined as genes in different species which derive from a common ancestor, are likely to perform the same function (1). We call genes within a species that have duplicated after the speciation event inparalogs, and they are by definition orthologous to one or more orthologs in another species since they descended from the same gene in the last common ancestor (2). In contrast, outparalogs have duplicated before the speciation event and are therefore not orthologs. Most ortholog-finding techniques are successful in cases where there is one copy of a gene in each species. By distinguishing between in- and out-paralogs, the InParanoid algorithm can identify one-to-many and many-to-many ortholog relationships.
There are now a large number of methods for predicting ortholog sets, reflecting the wide variety of applications over which these methods have specialized [reviewed in (3–9)]. Although these methods each have nuances in their approaches, one can broadly classify them into two groups: those which focus on building ortholog groups by clustering pairwise gene relationships and those which are based on tree reconstruction. The tree methods typically reconcile gene and species trees in order to assign duplication and speciation nodes, as well as detect gene losses. Both approaches have advantages and disadvantages. The pairwise methods are more applicable on a global scale, while the tree methods more directly try to reconstruct the evolutionary scenario. In three recent orthology database comparisons that seek to assess objectively the accuracy of functional annotation against a common gene set, the previous version of InParanoid ranked at the top (4,6,9). Altenhoff and Dessimoz (9) also did a phylogenetic test, in which InParanoid performed near the top. Tree-based methods generally performed worse than pairwise clustering methods, also sometimes in the phylogenetic test.
We here present InParanoid 7, comprising 99 eukaryotic species and Escherichia coli as a prokaryotic outgroup. We describe updates to the InParanoid algorithm and compare the results to the prior implementation. The new features of the web site such as the ortholog group view with sequence tree and domain architectures are delineated. We analyze the interspecies relationships in terms of orthology content, and a comparison of the source data sets relative to the previous version of InParanoid is provided. Finally, we introduce two new data formats, SeqXML and OrthoXML, designed to overcome the challenges of aggregating gene sets and benchmarking ortholog databases.
DATA AND IMPLEMENTATION
The proteomes were obtained from various sources. If possible we tried to download the data from Ensembl (10); this is mainly due to their regular updates of the data sets. In total, 23 genomes were retrieved from Ensembl, 17 from JGI (http://www.jgi.doe.gov/), 10 from FGI (http://www.broadinstitute.org/annotation/fungi/fgi/), 7 from Flybase (11), 7 from NCBI (12), 6 from WormBase (13), 6 from Sanger (http://www.sanger.ac.uk/), 4 from Génolevures Consortium (14), 3 from TIGR (http://www.tigr.org/db.shtml), 3 from VectorBase (15), 2 from PlasmoDB (16) and 2 from CryptoDB (17). Moreover, single proteomes were downloaded from GiardiaDB (18), Panther (19), Rice Genome Annotation Project (20), Dictybase (21), CGD (22), University of Tokyo (Cyanidioschyzon merolae Genome Project: http://merolae.biol.s.u-tokyo.ac.jp/), SilkDB (23), SGD (24), SGTC (25) and TAIR (26). Inparanoid 7 comprises 99 eukaryotic proteomes as well as one prokaryotic proteome. This set spans the range of sequenced eukaryotic species and includes 19 vertebrates, 35 invertebrates, 7 plants, 21 fungi and 17 protists. As we have traditionally done with InParanoid, the bacterium E. coli K12 is included as a token representative of the prokaryotes. A complete list of all species included in Inparanoid 7, as well as links to the respective data sources, can be found at the InParanoid web site (http://InParanoid.sbc.su.se/cgi-bin/summary.cgi). We aimed to include as many species as possible. However, in order to ensure a high level of completeness and quality, as in previous versions we have only considered genomes with at least 6X coverage and <1% of unknown amino acids (i.e. ‘X’ characters appearing in the protein sequence). Moreover, to avoid high levels of redundancy in the database, new genomes are only incorporated if at least 10% of the proteins are <90% identical from already included species. In practice, however, this rule did not cause rejection of any proteomes. To prevent different transcripts of the same gene from being assigned to different ortholog groups, only the longest protein for each gene was used.
Updates to algorithm
Overlap criteria
The InParanoid algorithm relies on BLAST as the underlying homology detection tool. As BLAST is a local alignment algorithm, high-scoring matches between parts of proteins, such as conserved domains, may receive high scores even though they do not reflect a common origin for the proteins as a whole. To avoid drawing conclusions from fragment matches of this type, BLAST homology inference is only accepted if the region aligned by BLAST corresponds to a large enough fraction of the lengths of the proteins. These overlap criteria have been made more stringent in version 4.0 of the InParanoid algorithm. For a match to be accepted as nonfragment, the following must be fulfilled. For both the query and the match sequence, the distance between the first and the last aligned residue must equal or exceed 50% of the length of the sequence. Furthermore, for both the query and the match sequence, the sum of the lengths of the aligned regions on that sequence must equal or exceed 25% of the length of the sequence. Note that when there are multiple high-scoring segment pairs (HSPs), InParanoid requires that they maintain the same relative order on both sequences, and that they do not overlap by >5%.
Low-complexity filters
A general issue in homology detection is the presence of false matches resulting from unrelated proteins sharing repetitive regions or regions with very biased amino acid composition. Based on an analysis of the effect that different filters have on precision and sensitivity (27), we adopted the following approach. Compositional adjustment (28,29) is applied, as is the SEG low-complexity filter (30), set so as to mask only during seeding but not during extension (soft masking). This more stringent low-complexity filtering permitted us to lower the score threshold from 50 to 40 bits. This results in high-quality homology inferences, increasing both InParanoid's; precision and sensitivity. However, as compositional adjustment often leads to shorter alignments (27), matches accepted in the first pass are realigned using BLAST with SEG and compositional adjustment switched off, before the overlap criteria mentioned previously are applied.
Evaluation of modifications to the algorithm
To ascertain that the above modifications do not produce dramatically different results, we evaluated the sets of clusters inferred by the current and previous algorithm between selected species pairs. As the underlying sequence base has been changed significantly in some cases, this evaluation was done by rebuilding parts of InParanoid release 6 using algorithm version 4.0, and then comparing the resulting cluster sets with the original InParanoid 6 (built with algorithm version 3.2). The comparison was limited to all combinations of Homo sapiens with all species included in InParanoid 6. The full results of this analysis are included in Supplementary Data. The average number of clusters across these species comparisons hardly changed between the algorithm revisions. On average, the cluster count was 2.3% smaller with the new algorithm. One-fourth of this reduction comes from a loss of 19% of the clusters with Apis mellifera, which is understandable as the version of this genome in InParanoid 6 was of low quality and subsequently was retracted by Ensembl. A large fraction of clusters are completely identical, from 72% for H. sapiens versus Oryza sativa, to 99% for H. sapiens versus Pan troglodytes, with the fraction increasing for more closely related species. Our interpretation of this is that the new, more stringent version of the algorithm infers fewer erroneous clusters of a type more often seen between distantly related species. While the difference between cluster sets can sometimes be substantial, we are confident that the stricter criteria produce orthology inferences that are biologically more sound.
Data processing with XML
With InParanoid 7, we have introduced a new data schema based on standardized XML files. This makes it easier to efficiently process data and, more importantly, to validate the content. We have replaced plain text files with XML files in as many places as possible throughout the InParanoid workflow (Figure 1). These changes, which the following sections describe in detail, dramatically increase the flexibility and robustness of the analysis pipeline and data exchange with the third parties.
Figure 1.
A diagram showing the use of XML in the InParanoid workflow. The InParanoid convert program starts with simple FASTA files that each have a different header line format. With the help of the species.xml file, it parses and converts them to SeqXML files, which can be easily processed and validated as input to the InParanoid algorithm. On the web site, the user can choose between different data formats; currently supported are SQL, TXT, HTML and OrthoXML.
Input
It is still common to provide sequence files in the FASTA format. Although it is a relatively easy format, being human readable and having only one header line, this simplicity causes data integrity problems due to the lack of standardization. There is no generally accepted way in defining the content of the header line. Furthermore, there can be invalid characters in the sequence and multiple entries of the same gene or protein in one file. A parser or a person has to safeguard against these issues; otherwise downstream analyses can produce erroneous results, often silently. By converting to a markup language like XML, it becomes a lot easier to avoid those issues.
In the conversion process, we also exploit XML to automatically process FASTA files. This is done by creating a file, species.xml, shown in Figure 1, containing regular expression patterns which are used to parse each type of FASTA header into appropriate data fields. In addition, the file also contains species metadata (taxon ID, database repository URL etc.) which makes it possible to track the sources and versions of each data set.
XML allows validation against a schema where one can exactly define how the content can be represented. For this purpose, we developed a new XML data schema called SeqXML. The SeqXML schema (XSD) defines the skeletal structure of the sequence files and allows one to set constraints for each type of data it contains: for example, one can limit a DNA sequence to consist only of {A,G,C,T,N}. If one then tries to import a DNA sequence containing a ‘Z’, this error will be detected automatically by any XML validator.
As with FASTA, a SeqXML file not only includes the gene or protein ID, a description and the sequence itself but also provides the option to add other data such as alternative identifiers or notes. It is our hope that SeqXML will be adopted by other sequence repositories and eventually replace FASTA for distribution of proteome data sets. The Reference Genome Annotation Project (31) has declared an intention to use SeqXML for standardized proteome sequences.
Output
InParanoid supports four different output formats: as an SQL table, plain text, HTML and a new, more general XML format called OrthoXML. The OrthoXML schema is defined broadly and supports orthology data not only from InParanoid but also from other sources as well. It is primarily aimed at holding nonhierarchical ortholog groups from pairwise clustering methods, but can in principle also hold hierarchical tree structures. As with SeqXML, the schema gives the ability to create a well-defined file with orthology data. The standardization of genome projects (Reference Genome Annotation Project) will create a set of genome datafiles available to all orthology methods. We hope that different orthology inference methods will use OrthoXML for their output, as this will make it substantially easier to parse their results and compare them. See http://www.OrthoXML.org for more information on OrthoXML and SeqXML.
Web interface
The InParanoid web site http://InParanoid.sbc.su.se has received a face-lift, resulting in a much brighter and clearer look. This new style is now uniform over all subpages. Without changing the basic functionality, we were able to significantly decrease the response times for all types of database requests. Both new and familiar users will find an intuitive and easy to use interface. As in the previous version, it is possible to browse all ortholog groups for every species pair and to search for the orthologs of a particular protein using identifiers, protein sequence or free text. In addition to visual and performance improvements, some minor features have been added. For instance, it is now possible to download the results of an identifier query as XML, and the free text search allows quoting of search strings and gives overall more accurate results. In addition to the primary identifiers taken from each proteome's; source, alternative identifiers from major databases like UniProtKB or GenBank are shown for each protein if available, and these identifiers are searchable.
Another new feature is the display of neighbor-joining bootstrap trees and domain annotations for each InParanoid cluster on the details page (Figure 2). To generate these trees, the sequences of a cluster are aligned using Kalign (32). The neighbor-joining tree is calculated with Belvu (33) where 100 bootstrap replicates are generated. Protein domains were predicted with HMMER (http://hmmer.janelia.org) by searching against Pfam 23 with Pfam_ls and Pfam_fs models. The visualization of the tree together with the Pfam domain architecture is written in Java and is shown as a Java applet or as an image if the browser does not support Java. The illustration of the domains follows the Pfam graphics guidelines (http://pfam.sbc.su.se/help, ‘Guide to Graphics’).
Figure 2.
The new InParanoid web interface. The screenshot in the upper left corner shows the InParanoid clusters between O. sativa and E. coli. For every cluster, i.e. ortholog group, the members are listed with the identifiers of the proteome source and a description. The InParanoid score is shown for every cluster member and bootstrap values are given for the seed orthologs. The bootstrap value indicates the fraction of intracluster bootstrap runs that placed the seed ortholog as the best match. Clicking on the cluster number leads to the details page of the cluster (right), again listing the members and also presenting their domain annotations and a neighbor-joining bootstrap tree of them. In the tree, branches leading to sequences of the same species have the same color, and upon clicking a domain, one is redirected to its Pfam page. In addition, the details page provides a range of possibilities to further investigate the cluster. A multiple sequence alignment can be viewed in Kalignvu (37) or downloaded in various formats such as FASTA, Stockholm, MSF or SELEX. The protein tree can be can be downloaded as picture or in NH format, and it is possible to edit the tree interactively in the ATV tree viewer (38).
INPARANOID CONTENT
As in the previous release, we generated an orthology-based phylogenetic tree by UPGMA clustering of pairwise species distances derived from shared ortholog content. The distances were calculated as 1 minus the fraction of orthologous proteins, averaged over both directions (34). This ‘orthophylogram’ is now too large to be shown as a figure but can be accessed online at http://InParanoid.sbc.su.se/download/current/orthophylogram.gif.
The difference between this tree and sequence alignment-based trees is that it reflects the entire proteome's; content and the level of sequence similarity is not explicitly taken into account. Because of this, but also because of incompleteness in the proteomes themselves, it may differ from classical phylogenetic trees. For most species, it corresponds to the accepted phylogeny, but a number of noteworthy differences were observed. For instance, the guinea pig (Cavia porcellus), which is a new species in release 7, clusters with dog rather than with other rodents. The egg-laying venomous mammal platypus (Ornithorhynchus anatinus) is strangely placed at the root of all other vertebrates outside of birds, frog and fish.
Intriguingly, the macaque monkey (Macaca mulatta) is placed far outside of the other primates, even outside cow and horse. This was not the case in release 6 and appears to be an artifact of the proteome sequence. As seen in Table 1, drastic changes have been made to the proteomes of human and chimpanzee between release 6 and 7 (>25% of the sequences have been modified), but macaque is essentially unchanged. Comparing the average identity of the best BLAST HSP between H. sapiens, P. troglodytes, M. mulatta, Bos taurus and Canis familiaris in both the previous and current versions showed no major changes (see Supplementary Data).
Table 1.
Consistency for proteomes found in both InParanoid 6 and 7
Species | Identical | Sequences v7/ | Identical | Average |
---|---|---|---|---|
sequences | Sequences v6 | IDs | identity | |
Apis mellifera | 0.06 | 0.68 | – | 0.63 |
Takifugu rubripes | 0.09 | 0.84 | – | 0.83 |
Tetraodon nigroviridis | 0.09 | 0.70 | – | 0.80 |
Danio rerio | 0.29 | 1.69 | – | 0.94 |
Anopheles gambiae | 0.33 | 0.94 | – | 0.82 |
Caenorhabditis remanei | 0.34 | 1.23 | – | 0.90 |
Drosophila pseudoobscura | 0.34 | 1.62 | – | 0.95 |
Bos taurus | 0.38 | 0.94 | 0.74 | 0.92 |
Cryotococcus neoformans | 0.48 | 1.01 | – | 0.96 |
Caenorhabditis briggsae | 0.48 | 1.13 | 0.48 | 0.94 |
Mus musculus | 0.49 | 1.00 | – | 0.92 |
Oryza sativa | 0.63 | 0.75 | – | |
Entamoeba histolytica | 0.66 | 0.87 | 0.84 | |
Pan troglodytes | 0.73 | 0.95 | 0.86 | |
Homo sapiens | 0.75 | 0.94 | 0.75 | |
Debaryomyces hansenii | 0.82 | 0.99 | – | |
Drosophila melanogaster | 0.87 | 1.02 | 0.83 | |
Caenorhibditis elegans | 0.90 | 1.00 | – | |
Yarrowia lipolytica | 0.90 | 0.99 | – | |
Canis familiaris | 0.92 | 1.00 | 1.00 | |
Arabidopsis thaliana | 0.93 | 0.98 | – | |
Monodelphis domestica | 0.94 | 0.99 | 0.99 | |
Escherichia coli K12 | 0.96 | 0.98 | – | |
Kluyveromyces lactis | 0.97 | 0.95 | – | |
Gasterosteus aculeatus | 0.97 | 1.00 | 1.00 | |
Candida glabrata | 0.97 | 1.00 | – | |
Dictyostelium discoideum | 0.97 | 0.99 | 0.89 | |
Schizosaccharomyces pombe | 0.97 | 1.00 | – | |
Gallus gallus | 0.98 | 1.00 | 0.99 | |
Xenopus tropicalis | 0.98 | 0.98 | 1.00 | |
Ciona intestinalis | 0.99 | 0.99 | 1.00 | |
Saccharomyces cerevisiae | 1.00 | 1.01 | 1.00 | |
Rattus norvegicus | 1.00 | 0.97 | 1.00 | |
Aedes aegypti | 1.00 | 1.00 | 1.00 |
Species | Identical | Sequences v7/ | Identical | Average |
---|---|---|---|---|
sequences | Sequences v6 | IDs | identity | |
Apis mellifera | 0.06 | 0.68 | – | 0.63 |
Takifugu rubripes | 0.09 | 0.84 | – | 0.83 |
Tetraodon nigroviridis | 0.09 | 0.70 | – | 0.80 |
Danio rerio | 0.29 | 1.69 | – | 0.94 |
Anopheles gambiae | 0.33 | 0.94 | – | 0.82 |
Caenorhabditis remanei | 0.34 | 1.23 | – | 0.90 |
Drosophila pseudoobscura | 0.34 | 1.62 | – | 0.95 |
Bos taurus | 0.38 | 0.94 | 0.74 | 0.92 |
Cryotococcus neoformans | 0.48 | 1.01 | – | 0.96 |
Caenorhabditis briggsae | 0.48 | 1.13 | 0.48 | 0.94 |
Mus musculus | 0.49 | 1.00 | – | 0.92 |
Oryza sativa | 0.63 | 0.75 | – | |
Entamoeba histolytica | 0.66 | 0.87 | 0.84 | |
Pan troglodytes | 0.73 | 0.95 | 0.86 | |
Homo sapiens | 0.75 | 0.94 | 0.75 | |
Debaryomyces hansenii | 0.82 | 0.99 | – | |
Drosophila melanogaster | 0.87 | 1.02 | 0.83 | |
Caenorhibditis elegans | 0.90 | 1.00 | – | |
Yarrowia lipolytica | 0.90 | 0.99 | – | |
Canis familiaris | 0.92 | 1.00 | 1.00 | |
Arabidopsis thaliana | 0.93 | 0.98 | – | |
Monodelphis domestica | 0.94 | 0.99 | 0.99 | |
Escherichia coli K12 | 0.96 | 0.98 | – | |
Kluyveromyces lactis | 0.97 | 0.95 | – | |
Gasterosteus aculeatus | 0.97 | 1.00 | 1.00 | |
Candida glabrata | 0.97 | 1.00 | – | |
Dictyostelium discoideum | 0.97 | 0.99 | 0.89 | |
Schizosaccharomyces pombe | 0.97 | 1.00 | – | |
Gallus gallus | 0.98 | 1.00 | 0.99 | |
Xenopus tropicalis | 0.98 | 0.98 | 1.00 | |
Ciona intestinalis | 0.99 | 0.99 | 1.00 | |
Saccharomyces cerevisiae | 1.00 | 1.01 | 1.00 | |
Rattus norvegicus | 1.00 | 0.97 | 1.00 | |
Aedes aegypti | 1.00 | 1.00 | 1.00 |
The ‘Identical sequences’ and ‘Identical IDs’ columns show the sequence checksums and gene identifiers common to both versions as a fraction of the version with the lowest number of sequences. Most species have a high fraction of identical sequences; for those <50% the average identity using BLAST (see text) between release 6 and 7 is shown. Of those, only A. mellifera has a low average identity. Thus, although in some species a large fraction of the proteins has been modified, the modifications are generally minor. ‘—’, not applicable due to different identifier systems in the two versions
Table 1.
Consistency for proteomes found in both InParanoid 6 and 7
Species | Identical | Sequences v7/ | Identical | Average |
---|---|---|---|---|
sequences | Sequences v6 | IDs | identity | |
Apis mellifera | 0.06 | 0.68 | – | 0.63 |
Takifugu rubripes | 0.09 | 0.84 | – | 0.83 |
Tetraodon nigroviridis | 0.09 | 0.70 | – | 0.80 |
Danio rerio | 0.29 | 1.69 | – | 0.94 |
Anopheles gambiae | 0.33 | 0.94 | – | 0.82 |
Caenorhabditis remanei | 0.34 | 1.23 | – | 0.90 |
Drosophila pseudoobscura | 0.34 | 1.62 | – | 0.95 |
Bos taurus | 0.38 | 0.94 | 0.74 | 0.92 |
Cryotococcus neoformans | 0.48 | 1.01 | – | 0.96 |
Caenorhabditis briggsae | 0.48 | 1.13 | 0.48 | 0.94 |
Mus musculus | 0.49 | 1.00 | – | 0.92 |
Oryza sativa | 0.63 | 0.75 | – | |
Entamoeba histolytica | 0.66 | 0.87 | 0.84 | |
Pan troglodytes | 0.73 | 0.95 | 0.86 | |
Homo sapiens | 0.75 | 0.94 | 0.75 | |
Debaryomyces hansenii | 0.82 | 0.99 | – | |
Drosophila melanogaster | 0.87 | 1.02 | 0.83 | |
Caenorhibditis elegans | 0.90 | 1.00 | – | |
Yarrowia lipolytica | 0.90 | 0.99 | – | |
Canis familiaris | 0.92 | 1.00 | 1.00 | |
Arabidopsis thaliana | 0.93 | 0.98 | – | |
Monodelphis domestica | 0.94 | 0.99 | 0.99 | |
Escherichia coli K12 | 0.96 | 0.98 | – | |
Kluyveromyces lactis | 0.97 | 0.95 | – | |
Gasterosteus aculeatus | 0.97 | 1.00 | 1.00 | |
Candida glabrata | 0.97 | 1.00 | – | |
Dictyostelium discoideum | 0.97 | 0.99 | 0.89 | |
Schizosaccharomyces pombe | 0.97 | 1.00 | – | |
Gallus gallus | 0.98 | 1.00 | 0.99 | |
Xenopus tropicalis | 0.98 | 0.98 | 1.00 | |
Ciona intestinalis | 0.99 | 0.99 | 1.00 | |
Saccharomyces cerevisiae | 1.00 | 1.01 | 1.00 | |
Rattus norvegicus | 1.00 | 0.97 | 1.00 | |
Aedes aegypti | 1.00 | 1.00 | 1.00 |
Species | Identical | Sequences v7/ | Identical | Average |
---|---|---|---|---|
sequences | Sequences v6 | IDs | identity | |
Apis mellifera | 0.06 | 0.68 | – | 0.63 |
Takifugu rubripes | 0.09 | 0.84 | – | 0.83 |
Tetraodon nigroviridis | 0.09 | 0.70 | – | 0.80 |
Danio rerio | 0.29 | 1.69 | – | 0.94 |
Anopheles gambiae | 0.33 | 0.94 | – | 0.82 |
Caenorhabditis remanei | 0.34 | 1.23 | – | 0.90 |
Drosophila pseudoobscura | 0.34 | 1.62 | – | 0.95 |
Bos taurus | 0.38 | 0.94 | 0.74 | 0.92 |
Cryotococcus neoformans | 0.48 | 1.01 | – | 0.96 |
Caenorhabditis briggsae | 0.48 | 1.13 | 0.48 | 0.94 |
Mus musculus | 0.49 | 1.00 | – | 0.92 |
Oryza sativa | 0.63 | 0.75 | – | |
Entamoeba histolytica | 0.66 | 0.87 | 0.84 | |
Pan troglodytes | 0.73 | 0.95 | 0.86 | |
Homo sapiens | 0.75 | 0.94 | 0.75 | |
Debaryomyces hansenii | 0.82 | 0.99 | – | |
Drosophila melanogaster | 0.87 | 1.02 | 0.83 | |
Caenorhibditis elegans | 0.90 | 1.00 | – | |
Yarrowia lipolytica | 0.90 | 0.99 | – | |
Canis familiaris | 0.92 | 1.00 | 1.00 | |
Arabidopsis thaliana | 0.93 | 0.98 | – | |
Monodelphis domestica | 0.94 | 0.99 | 0.99 | |
Escherichia coli K12 | 0.96 | 0.98 | – | |
Kluyveromyces lactis | 0.97 | 0.95 | – | |
Gasterosteus aculeatus | 0.97 | 1.00 | 1.00 | |
Candida glabrata | 0.97 | 1.00 | – | |
Dictyostelium discoideum | 0.97 | 0.99 | 0.89 | |
Schizosaccharomyces pombe | 0.97 | 1.00 | – | |
Gallus gallus | 0.98 | 1.00 | 0.99 | |
Xenopus tropicalis | 0.98 | 0.98 | 1.00 | |
Ciona intestinalis | 0.99 | 0.99 | 1.00 | |
Saccharomyces cerevisiae | 1.00 | 1.01 | 1.00 | |
Rattus norvegicus | 1.00 | 0.97 | 1.00 | |
Aedes aegypti | 1.00 | 1.00 | 1.00 |
The ‘Identical sequences’ and ‘Identical IDs’ columns show the sequence checksums and gene identifiers common to both versions as a fraction of the version with the lowest number of sequences. Most species have a high fraction of identical sequences; for those <50% the average identity using BLAST (see text) between release 6 and 7 is shown. Of those, only A. mellifera has a low average identity. Thus, although in some species a large fraction of the proteins has been modified, the modifications are generally minor. ‘—’, not applicable due to different identifier systems in the two versions
However, looking at one-way fractions of shared orthologs reveals the problem. The distance ‘to _H. sapiens_’ was higher for M. mulatta than for all other species in the group. Also, the distance to chimpanzee and to orangutan was highest or second highest for M. mulatta. This indicates that macaque contains a large number of proteins that did not find orthologs in closely related species. It is possible that these are fragments or short splice variants, preventing them from being detected as orthologs. Even if the same splice variant exists in human, it would not be used by InParanoid if a longer variant exists, and the orthology may be lost due to small overlap. It thus seems that the macaque gene annotations should be updated to be more in line with other primates.
One of the orthophylogram anomalies found with InParanoid 6 was that Danio rerio was not grouped with other fishes. This is, however, the case in release 7, although as an outlier of the other fishes, not far from its placement in the previous release. Opossum, which was grouped within placental mammals, is still found in this clade, although in a different place. The orthophylogram is thus a useful tool for identifying inconsistencies in the proteome data and will hopefully spur genome annotators to improve gene predictions.
The average number of inparalogs per cluster ranged from 1.00 (between Cryptosporidium hominis and parvum) to 5.31 (Trichomonas vaginalis when compared with Giardia lamblia, both protozoans). This is in concordance with the early divergence of T. vaginalis and G. lamblia (35) as well as with C. hominis and C. parvum being closely related (36). The overall mean number of inparalogs per species was 1.46, and the median was 1.27. The distribution of cluster sizes is shown in Figure 3.
Figure 3.
Histogram of the average number of inparalogs/cluster per species for all species–species comparisons in InParanoid 7. Vertebrates and fungi generally have a lower number of inparalogs per clusters—always <3, whereas invertebrates, protists and plants can have as many as five inparalogs/cluster on average.
Proteome consistency
The input sequences used by InParanoid often changes with new releases. This can be due to a change in our sources for the data and/or changes in the genome annotations themselves. As this could result in different orthology assignment between versions, we examined whether each proteome differed with its corresponding proteome used in the previous version. For each species found in both versions, we compared sequences using checksums and identifiers. We computed a checksum for each sequence and counted the fraction of matching checksums between versions. Similarly, we counted the number of identifiers common to both versions. A large change in the number of proteins between versions (due to extensive genome reannotation, for example) could prevent a large fraction of sequences in one version from being matched in the other. We therefore calculated the fractions by dividing the matches with the number of sequences which is lowest between the two versions.
Most proteomes showed a large fraction of shared identical sequences while a minority was drastically changed (Table 1). The source for some species was changed between releases 6 and 7 of InParanoid, while in other cases all identifiers were changed by the source. A comparison of identifiers was therefore not possible in most cases, but where identifiers were comparable the consistency between the versions was generally high (Table 1).
The changes to the proteomes with a low fraction of shared identical sequences could potentially be large enough to affect the orthology assignment. In order to determine if this was the case, we performed whole-proteome BLAST comparisons of the proteomes with a low consistency between versions. Using the version with the fewest sequences as query and the version with the most sequences as database, we computed the average match identity as the number of identical residues in the best HSP divided by the length of the query. The results varied between 63% for A. mellifera to 96% for Cryptococcus neoformans, with most being above 90% (Table 1). These changes should reflect improvements in proteome quality. For example, the A. mellifera proteome previously used has been deprecated and removed from Ensembl, so the orthology assignment in the new version should be more accurate.
FUTURE PERSPECTIVES
Although the InParanoid algorithm is fully automatic, building the latest InParanoid release involved many time consuming manual steps. Perhaps the most challenging task was to gather the proteomes from difference sources in different formats and making sure that the contents are error free and complete. We hope that by the introduction of standardized proteome repositories and usage of robust XML formats much of this labor will be reduced. Much of the workflow in the InParanoid pipeline and web site is now automated using XML. The pairwise nature of the method means that its time complexity scales O(N2). Compute resources may therefore become a problem in the future, which would require more time-efficient algorithms or an incremental updating scheme.
SUPPLEMENTARY DATA
Supplementary Data are available at NAR Online.
ACKNOWLEDGEMENTS
The authors thank Patrik Björkholm for assistance with gathering proteomes and Erik Sjölund for assistance with the web site back-end.
FUNDING
Funding for open access charge: Swedish Research Council.
Conflict of interest statement. None declared.
REFERENCES
.
Distinguishing homologous from analogous proteins
.
Syst. Zool.
(
1970
)
19
:
99
–
113
.
, .
Orthology, paralogy and proposed classification for paralog subtypes
.
Trends Genet.
(
2002
)
18
:
619
–
620
.
, , , .
Overview and comparison of ortholog databases
.
Drug Discov. Today Tech.
(
2006
)
3
:
137
–
143
.
, , , .
Benchmarking ortholog identification methods using functional genomics data
.
Genome Biol.
(
2006
)
7
:
R31
.
, .
Orthology and functional conservation in eukaryotes
.
Annu. Rev. Genet.
(
2007
)
41
:
465
–
507
.
, , , .
Assessing performance of orthology detection strategies applied to eukaryotic genomes
.
PLoS ONE
(
2007
)
2
:
e383
.
.
Large-scale assignment of orthology: back to phylogenetics?
Genome Biol.
(
2008
)
9
:
235
.
, , , .
The quest for orthologs: finding the corresponding gene across genomes
.
Trends Genet.
(
2008
)
24
:
539
–
551
.
, .
Phylogenetic and functional assessment of orthologs inference projects and methods
.
PLoS Comp. Biol.
(
2009
)
5
:
e1000262
.
, , , , , , , , , , et al.
Ensembl 2009
.
Nucleic Acids Res.
(
2009
)
37
:
D690
–
D697
.
, , , , , , , , , , et al,
FlyBase Consortium
.
FlyBase: enhancing Drosophila Gene Ontology annotations
.
Nucleic Acids Res.
(
2009
)
37
:
D555
–
D559
.
, , .
NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins
.
Nucleic Acids Res.
(
2007
)
35
:
D61
–
65
.
, , , , , , , , , , et al.
WormBase: new content and better access
.
Nucleic Acids Res.
(
2007
)
35
:
D506
–
D510
.
, , , , , ,
Génolevures Consortium
.
Génolevures: protein families and synteny among complete hemiascomycetous yeast proteomes and genomes
.
Nucleic Acids Res.
(
2009
)
37
:
D550
–
D554
.
, , , , , , , , , , et al.
VectorBase: a home for invertebrate vectors of human pathogens
.
Nucleic Acids Res.
(
2007
)
35
:
D503
–
D505
.
, , , , , , , , , , et al.
PlasmoDB: a functional genomic database for malaria parasites
.
Nucleic Acids Res.
(
2009
)
37
:
D539
–
D543
.
, , , , , , , , , , et al.
CryptoDB: a Cryptosporidium bioinformatics resource update
.
Nucleic Acids Res.
(
2006
)
34
:
D419
–
D422
.
, , , , , , , , , , et al.
GiardiaDB and TrichDB: integrated genomic resources for the eukaryotic protist pathogens Giardia lamblia and Trichomonas vaginalis
.
Nucleic Acids Res.
(
2009
)
37
:
D526
–
D530
.
, , , .
PANTHER version 6: protein sequence and function evolution data with expanded representation of biological pathways
.
Nucleic Acids Res.
(
2007
)
35
:
D247
–
D252
.
, , , , , , , , , , et al.
The TIGR Rice Genome Annotation Resource: improvements and new features
.
Nucleic Acids Res.
(
2007
)
35
:
D883
–
D887
.
, , , , , , , , , .
DictyBase–a Dictyostelium bioinformatics resource update
.
Nucleic Acids Res.
(
2009
)
37
:
D515
–
D519
.
, , , , , , , .
Sequence resources at the Candida Genome Database (CGD)
.
Nucleic Acids Res.
(
2007
)
35
:
D452
–
D456
.
, , , , , , , , , , et al.
SilkDB: a knowledgebase for silkworm biology and genomics
.
Nucleic Acids Res.
(
2005
)
33
:
D399
–
D402
.
, , , , , , , , , , et al.
Gene Ontology annotations at SGD: new data sources and annotation methods
.
Nucleic Acids Res.
(
2008
)
36
:
D577
–
D581
.
Stanford Genome Technology Center
. C. neoformans Genome Project. Funded by the NIAID/NIH under cooperative agreement AI47087; The Institute for Genomic Research. Funded by the NIAID/NIH under cooperative agreement U01 AI48594. Data release: 23 June 2004.
, , , , , , , , , , et al.
The Arabidopsis Information Resource (TAIR): gene structure and function annotation
.
Nucleic Acids Res.
(
2008
)
36
:
D1009
–
D1014
.
, .
Benchmarking homology detection procedures with low complexity filters
.
Bioinformatics
(
2009
)
25
:
2500
–
2505
.
, , , , , , .
Protein database searches using compositionally adjusted substitution matrices
.
FEBS J.
, (
2005
)
272
:
5101
–
5109
.
, .
The construction of amino acid substitution matrices for the comparison of proteins with non-standard compositions
.
Bioinformatics
(
2005
)
21
:
902
–
911
.
, .
Statistics of local complexity in amino acid sequences and sequence databases
.
Comp. Chem.
(
1993
)
17
:
149
–
163
.
The Reference Genome Group of the Gene Ontology Consortium
.
The Gene Ontology's; Reference Genome Project: a unified framework for functional annotation across species
.
PLoS Comput. Biol.
(
2009
)
5
:
e1000431
.
, , .
Kalign2: high-performance multiple alignment of protein and nucleotide sequences allowing external features
.
Nucleic Acids Res.
(
2009
)
37
:
858
–
865
.
, .
Scoredist: a simple and robust protein sequence distance estimator
.
BMC Bioinformatics
(
2005
)
6
:
108
.
, , , .
InParanoid 6: eukaryotic ortholog clusters with inparalogs
.
Nucleic Acids Res
(
2008
)
36
:
D263
–
D266
.
, , , .
Molecular biology of the amitochondriate parasites, Giardia intestinalis, Entamoeba histolytica and Trichomonas vaginalis
.
Int. J., Parasitol.
(
2003
)
33
:
235
–
255
.
, , , , , , , , , , et al.
The genome of Cryptosporidium hominis
.
Nature
(
2008
)
431
:
1107
–
1112
.
, .
Kalign, Kalignvu and Mumsa: web servers for multiple sequence alignment
.
Nucleic Acids Res.
(
2006
)
34
:
W596
–
W599
.
, .
ATV: display and manipulation of annotated phylogenetic trees
.
Bioinformatics
(
2001
)
17
:
383
–
384
.
Author notes
The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors.
Present address: Tina Köstler, Center for Integrative Bioinformatics Vienna, Max F. Perutz Laboratories, Dr Bohr Gasse 9, A-1030 Wien, Austria.
© The Author(s) 2009. Published by Oxford University Press.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Supplementary data
I agree to the terms and conditions. You must accept the terms and conditions.
Submit a comment
Name
Affiliations
Comment title
Comment
You have entered an invalid code
Thank you for submitting a comment on this article. Your comment will be reviewed and published at the journal's discretion. Please check for further notifications by email.
Citations
Views
Altmetric
Metrics
Total Views 2,801
2,186 Pageviews
615 PDF Downloads
Since 4/1/2017
Month: | Total Views: |
---|---|
April 2017 | 10 |
May 2017 | 10 |
June 2017 | 7 |
July 2017 | 11 |
August 2017 | 12 |
September 2017 | 11 |
October 2017 | 14 |
November 2017 | 6 |
December 2017 | 39 |
January 2018 | 37 |
February 2018 | 19 |
March 2018 | 20 |
April 2018 | 44 |
May 2018 | 51 |
June 2018 | 54 |
July 2018 | 41 |
August 2018 | 45 |
September 2018 | 37 |
October 2018 | 24 |
November 2018 | 41 |
December 2018 | 25 |
January 2019 | 22 |
February 2019 | 18 |
March 2019 | 43 |
April 2019 | 53 |
May 2019 | 40 |
June 2019 | 40 |
July 2019 | 31 |
August 2019 | 63 |
September 2019 | 49 |
October 2019 | 32 |
November 2019 | 31 |
December 2019 | 40 |
January 2020 | 40 |
February 2020 | 42 |
March 2020 | 35 |
April 2020 | 20 |
May 2020 | 22 |
June 2020 | 20 |
July 2020 | 29 |
August 2020 | 37 |
September 2020 | 48 |
October 2020 | 24 |
November 2020 | 19 |
December 2020 | 32 |
January 2021 | 17 |
February 2021 | 36 |
March 2021 | 40 |
April 2021 | 9 |
May 2021 | 19 |
June 2021 | 26 |
July 2021 | 33 |
August 2021 | 18 |
September 2021 | 25 |
October 2021 | 34 |
November 2021 | 35 |
December 2021 | 29 |
January 2022 | 18 |
February 2022 | 27 |
March 2022 | 31 |
April 2022 | 25 |
May 2022 | 35 |
June 2022 | 23 |
July 2022 | 30 |
August 2022 | 28 |
September 2022 | 42 |
October 2022 | 19 |
November 2022 | 30 |
December 2022 | 21 |
January 2023 | 21 |
February 2023 | 18 |
March 2023 | 32 |
April 2023 | 34 |
May 2023 | 22 |
June 2023 | 22 |
July 2023 | 12 |
August 2023 | 22 |
September 2023 | 23 |
October 2023 | 34 |
November 2023 | 37 |
December 2023 | 46 |
January 2024 | 47 |
February 2024 | 38 |
March 2024 | 57 |
April 2024 | 37 |
May 2024 | 45 |
June 2024 | 19 |
July 2024 | 36 |
August 2024 | 97 |
September 2024 | 43 |
October 2024 | 21 |
Citations
488 Web of Science
×
Email alerts
Citing articles via
More from Oxford Academic