Inparanoid: a comprehensive database of eukaryotic orthologs (original) (raw)

Journal Article

Search for other works by this author on:

Published:

01 January 2005

Cite

Kevin P. O'Brien, Maido Remm, Erik L. L. Sonnhammer, Inparanoid: a comprehensive database of eukaryotic orthologs, Nucleic Acids Research, Volume 33, Issue suppl_1, 1 January 2005, Pages D476–D480, https://doi.org/10.1093/nar/gki107
Close

Navbar Search Filter Mobile Enter search term Search

Abstract

The Inparanoid eukaryotic ortholog database ( http://inparanoid.cgb.ki.se/ ) is a collection of pairwise ortholog groups between 17 whole genomes; Anopheles gambiae , Caenorhabditis briggsae , Caenorhabditis elegans , Drosophila melanogaster , Danio rerio , Takifugu rubripes , Gallus gallus , Homo sapiens , Mus musculus , Pan troglodytes , Rattus norvegicus , Oryza sativa , Plasmodium falciparum , Arabidopsis thaliana , Escherichia coli , Saccharomyces cerevisiae and Schizosaccharomyces pombe . Complete proteomes for these genomes were derived from Ensembl and UniProt and compared pairwise using Blast, followed by a clustering step using the Inparanoid program. An Inparanoid cluster is seeded by a reciprocally best-matching ortholog pair, around which inparalogs (should they exist) are gathered independently, while outparalogs are excluded. The ortholog clusters can be searched on the website using Ensembl gene/protein or UniProt identifiers, annotation text or by Blast alignment against our protein datasets. The entire dataset can be downloaded, as can the Inparanoid program itself.

Recieved August 15, 2004; Revised and Accepted October 18, 2004

INTRODUCTION

The Inparanoid program was developed specifically to identify clusters of true orthologs while avoiding inclusion of closely related but non-orthologous proteins ( 1 ). Homologs which originate following gene duplications are called paralogs, a term in biology often mistakenly thought to apply to homologs within a genome. Paralogy can exist between genes in different species, since gene duplication events occur both before and after speciation. Thus, the term ‘inparalogs’ indicate paralogs that arose through a gene duplication event after speciation, while ‘outparalogs’ arise following a gene duplication preceding speciation ( 1 – 3 ). Outparalogs can never be orthologs, while inparalogs can form a group of genes that together are orthologous to a gene in another species. It is therefore important to distinguish between the two. Clustering inparalogs together allows proper identification of both one-to-one and many-to-many orthology cases ( Figure 1 ). More in-depth information on this subject, the Inparanoid program and its applications has been published previously ( 1 , 2 ).

Figure 1.

A hypothetical gene tree and the resulting Inparanoid clusters are shown to illustrate inparalog (and thus co-ortholog) and outparalog assignments. ( a ) Protein A in an ancestral species ‘A’ undergoes a gene duplication. A speciation event occurs which gives rise to the two lineages leading to species ‘B’ and ‘C’. In the C genome the genes C2 and C3 are inparalogs since their gene duplication occurred after speciation; they are co-orthologous to the B2 gene (one common ancestral protein upon speciation). B1 is an outparalog of the C2 and C3 genes, as are B1 of B2 (duplication and divergence prior to speciation). ( b ) B2 and C2 are the original seed-ortholog pair (all inparalogs are clustered around this pair), thus both receiving an inparalog score of 1.0. Other inparalogs (in this case C3 ) are scored according to their relative similarity to the seed-inparalog (here C2 ). Inparalog score of C3 = (Blast[ C2 : C3 ]−Blast[ C2 : B2 ])/(Blast[ C2 : C2 ]−Blast[ C2 : B2 ]) where Blast[X:Y] is the averaged blast score between X and Y in bits. In this case C2 is relatively more similar to B2 than C3 is, and thus C3 receives a lower inparalog score (0.7). C1 and B1 are orthologous to each other but are outparalogs of the other cluster and thus form a cluster of their own.

Here, we present a new release and a new online database for Inparanoid. The original release was based entirely on Swiss-Prot-TrEMBL ( 4 ) (now called UniProt) due to the quality and quantity of information-curation for these entries ( 1 ). In this release, however, we use Ensembl translation datasets as the main backbone of datasets. This is due to the rapid release and curation of whole genomes/proteomes through the Ensembl pipeline ( 5 ), and the better redundancy control. The current database contains all 16 completely sequenced eukaryotic genomes and Escherichia coli . For users preferring UniProt, we also provide a UniProt-only section that contains six eukaryotic genomes and E.coli .

Algorithmically, only minor changes have been made. As in previous releases we run Blast to compare all species against all, and feed the result to the Inparanoid program, which generates clusters of orthologs. The new website contains the entire old browse and search mechanisms, i.e. by gene/protein identifiers and Blast searching. New features include better layout and download capability of each ortholog group as FASTA sequences or as a multiple alignment for further analysis.

DESIGN AND IMPLEMENTATION

Ensembl-based datasets

The translated peptide sequences of all predicted transcripts, including splice variants, were obtained from the Ensembl resource ( 5 ). Peptide data were obtained for Anopheles gambiae (15 802 transcripts), Caenorhabditis briggsae (14 713 transcripts), Caenorhabditis elegans (22 215 transcripts), Drosophila melanogaster (18 289 transcripts), Danio rerio (30 783 transcripts), Takifugu rubripes (33 003 transcripts), Gallus gallus (28 416 transcripts), Homo sapiens (34 091 transcripts), Mus musculus (32 281 transcripts), Pan troglodytes (38 822 transcripts) and Rattus norvegicus (28 545 transcripts). Owing to the competitive nature of Inparanoid clustering, long and short transcripts from the same gene can end up in different clusters if they exist in more than one species. Thus, only the longest transcript from each gene was used. The number of genes/proteins can be seen in Table 1 . Gene names were obtained from informative Ensembl entry descriptions. Entries lacking adequate descriptions were named using either a corresponding UniProt entry description, an external database name or an Ensembl family name, in that order of preference. In addition, peptide data were obtained from the relevant resources for Oryza sativa ( 6 ) and Plasmodium falciparum ( 7 ). The Arabidopsis thaliana , E.coli , Saccharomyces cerevisiae and Schizosaccharomyces pombe sequences that were used in this section were obtained from UniProt ( 8 ) as described below.

Table 1.

Total number of orthologs in Inparanoid

The number of orthologs in an organism ( y -axis) when clustered with another genome ( x -axis) is shown. Thus, 5623 G.gallus genes have orthologs in C.elegans , which are orthologous to a total of 5096 C.elegans genes. ‘Protein dataset’ indicates the total translation/protein set obtained. ‘Proteins analyzed’ indicates the number of proteins (for Ensembl entries indicates number of genes) used in Inparanoid clustering.

Table 1.

Total number of orthologs in Inparanoid

‘UniProt-only’ datasets

The complete protein sequence database from UniProt was obtained from ftp.ebi.ac.uk ( 8 ). Relevant organisms were extracted according to the taxonomy ID number and converted into multiple-FASTA format using the SWISS module of the BIOPERL package ( 9 ). In a step aimed at reducing the redundancy observed between TrEMBL and Swiss-Prot, all proteins with 100% matches to other proteins over the full length were removed, with preference shown to Swiss-Prot entries. TrEMBL proteins with 99–100% matches were removed only if the protein name matched the Swiss-Prot entry or if they were fragment sequences. It should be mentioned that this is a reduced dataset and protein duplicates exist, but lowering the match cutoff would result in inparalog deletion. The dataset size for each organism used in this study following this processing step was; A.thaliana (34 170), E.coli (8901), C.elegans (20 627), H.sapiens (36 379), S.cerevisiae (6706), S.pombe (5187), M.musculus (34 499) and D.melanogaster (18 932).

Inparanoid clustering

Whole genome NCBI Blast ( 10 ) comparisons using these datasets were performed between each pair of species. The Blast output (organism A → organism B, organism B → organism A, organism A → organism A and organism B → organism B) was used as the input for the Inparanoid program as described previously ( 1 ). The SQL table and HTML output from these 136 Inparanoid all-against-all pairwise analyses (157 including ‘UniProt-only’ analyses) make up the dataset of the present web database.

INPARANOID DATABASE CONTENT

A summary of the data in Inparanoid is shown in Table 1 . This table shows the organism of interest on the left and the organism with which it is being clustered across the top. One should take note that this is not a symmetrical table, since more duplications could have occurred since speciation in organism ‘A’ when compared to organism ‘B’ and thus the number of ‘A’ genes that have orthologs in organism ‘B’ can differ substantially to the number of ‘B’ genes that have orthologs in organism ‘A’.

The sizes of clusters were found to be smaller in the Ensembl datasets compared to previous Inparanoid versions, i.e. those based on UniProt-only (and where no duplicates were removed). In addition, there is also more redundancy in the ‘UniProt-only’ datasets than in Ensembl. For example, in the previous version of Inparanoid (version 2.5; no UniProt duplicate removal), there were on average 2.75 human inparalogs per ortholog group when comparing to D.melanogaster . In the present versions this number is reduced to 1.64, and in the Uniprot-only dataset to 2.44. Previously, using UniProt, there was no robust way to confirm if two entries belonged to the same gene, i.e. if they were isoforms or allele variants. The use of Ensembl entries solves this problem, since Ensembl is a DNA sequence-based database and thus all protein and transcript information has a corresponding gene. The lowest average cluster size in Inparanoid was seen in comparisons between the H.sapiens and P.troglodytes datasets (2.04 chimpanzee and human genes/cluster) and the C.briggsae and C.elegans datasets (2.10 worm genes/cluster). A clear trend seen was that the larger the evolutionary distance, the larger the average cluster size (data not shown).

The datasets which were generated in-house are also supplemented by annotation data from both UniProt and Ensembl. Gene/protein identifiers, other external identifiers and full names/descriptions of each gene are used in the text search tool, as well as information pertaining to the source of a protein's annotation, e.g. in cases where an Ensembl translation derives its name from similarity to a UniProt entry.

WEB INTERFACE

The Inparanoid online database http://inparanoid.cgb.ki.se is organized into two main areas: Ensembl and UniProt. Since the focus of this web tool is to provide an ortholog resource for newly released genomes of interest, our focus has shifted toward Ensembl. The ‘UniProt-only’ section is maintained for those who wish to continue using Inparanoid with exclusively Uniprot proteins, since not all UniProt protein entries are annotated/linked to entries in Ensembl (especially those derived from TrEMBL). It is completely independent of the main Ensembl-based dataset and depending on usage-loads, it may be removed altogether in the future. Its tools are very similar to the main section of Inparanoid, differing mainly in format. Thus, it will not be discussed further here.

The database can be accessed in several ways. The first section, ‘Human vs All’ allows the user to select an organism to display all Inparanoid clusters between it and human. The dataset being displayed is quite large and therefore it may take a few moments to load. The second section ‘All species vs All’ is similar except that one can freely choose which two organisms to pick. As on date, there is no difference in the datasets accessed by these two tools, but it is planned to include many more organisms in the ‘Human vs All’ tool to allow a greater flexibility in scaling up the Inparanoid database to include scores of organisms as they become available.

The next three tools take a different approach as they first identify a gene before proceeding to see whether it has orthologs in other organisms. ‘Gene Search’ requires an identifier from Ensembl, Flybase, Uniprot, a locus identifier for the rice genome or an accession number for plasmodium cDNA libraries to select a gene in an organism of choosing. The Inparanoid score cutoff can be raised to exclude borderline inparalog cases (outparalogs are automatically excluded from clusters by the Inparanoid program). A sample output can be seen in Figure 2 . Each table represents a separate Inparanoid cluster, which are genes thought to derive from a single ancestral gene at the speciation. ‘Text search’ is a more flexible search which first outputs a list of genes whose annotation matches the query text string. Clicking on the ‘Search for Clusters’ icon then queries the Inparanoid database for ortholog clusters in all organisms ( Figure 3 ). The figure also demonstrates that one can obtain a multiple FASTA file and a multiple alignment generated using Kalign, a rapid multiple-alignment generator developed in-house (T. Lassmann and E.L.L. Sonnhammer, manuscript in preperation) This last feature is useful to verify the correctness of the cluster that Inparanoid has generated. The last tool ‘Blast Search’ allows one to enter a sequence to Blast against the protein datasets used in the creation of this database. The output provides a list of best-hits; as before, clicking on the ‘Search for Clusters’ icon queries the database for clusters.

Figure 2.

An Inparanoid cluster is a representation of genes thought to share a single ancestral gene upon speciation. In this example output only human–mosquito and human–worm clusters are shown. In the human–worm cluster, the two human genes are inparalogs, i.e. resulted from a gene duplication after the speciation from worm occurred. They are thus both co-orthologous to the worm gene. The Inparanoid score is a measure of how similar an inparalog is to the inparalog that is the main ortholog. If they are identical the score is 1.0, but as the similarity drops towards the similarity of the main orthologs, the score goes to 0.0 (see Figure 1 ). For example, ENSP00000343386 is less similar to the C.briggsae gene than is ENSP00000322439, but both these human genes are still co-orthologous to it. The bootstrapping score is a measure of how reliably that gene is the main ortholog. Gene and protein identifiers are hyperlinks to the relevant databases for each species.

Figure 3.

Using a gene name as a search generates a list of possible gene hits. Clicking on the ‘Search for Clusters’ icon queries the database as to whether a gene occurs in an Inparanoid-cluster in all organisms. Clicking on the cluster name generates a multiple FASTA file for this cluster and performs a multiple alignment using the Kalign program. These can be used to check the validity of the cluster in question and can be saved to disk.

DATA AVAILABILITY

The Inparanoid program and all other required resources are available for download and can be run locally. In addition to the data which is available for search/browse using the web interface, FASTA files containing all proteins, protein description files, SQL tables and HTML output from each pairwise Inparanoid analysis are available for download.

FURTHER DEVELOPMENTS

We plan to update Inparanoid on a quarterly basis. As mentioned above, the most pertinent update is to increase the number of organisms analyzed. As the rate at which new genomes become sequenced increases, it remains difficult to maintain the ‘All vs All’ approach for all organisms as this would soon render each Inparanoid update an impossibly large task. Thus, the more manageable ‘Human vs All’ section is expected to be the main update target, with the possibility of including additional sections, e.g. ‘Mouse vs All’, depending on demand. A visualization tool that can show ortholog clusters together with closely related outparalogs may be useful for examining the broader evolutionary histories of genes and their orthologs and is under consideration.

The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use permissions, please contact journals.permissions@oupjournals.org .

This study was funded by grants from the Swedish research council, Karolinska Institutet and Pfizer Corporation.

REFERENCES

Remm,M., Storm,C.E. and Sonnhammer,E.L. (

2001

) Automatic clustering of orthologs and in-paralogs from pairwise species comparisons.

J. Mol. Biol.

314

1041

–1052.

O'Brien,K.P., Westerlund,I. and Sonnhammer,E.L. (

2004

) OrthoDisease: a database of human disease orthologs.

Hum. Mutat.

112

–119.

Sonnhammer,E.L. and Koonin,E.V. (

2002

) Orthology, paralogy and proposed classification for paralog subtypes.

Trends Genet.

619

–620.

Boeckmann,B., Bairoch,A., Apweiler,R., Blatter,M.C., Estreicher,A., Gasteiger,E., Martin,M.J., Michoud,K., O'Donovan,C., Phan,I. et al . (

2003

) The Swiss-Prot protein knowledgebase and its supplement TrEMBL in 2003.

Nucleic Acids Res.

365

–370.

Birney,E., Andrews,D., Bevan,P., Caccamo,M., Cameron,G., Chen,Y., Clarke,L., Coates,G., Cox,T., Cuff,J. et al . (

2004

) Ensembl 2004.

Nucleic Acids Res.

D468

–D470.

Yuan,Q., Ouyang,S., Liu,J., Suh,B., Cheung,F., Sultana,R., Lee,D., Quackenbush,J. and Buell,C.R. (

2003

) The TIGR rice genome annotation resource: annotating the rice genome and creating resources for plant biologists.

Nucleic Acids Res.

229

–233.

Gardner,M.J., Hall,N., Fung,E., White,O., Berriman,M., Hyman,R.W., Carlton,J.M., Pain,A., Nelson,K.E., Bowman,S. et al . (

2002

) Genome sequence of the human malaria parasite Plasmodium falciparum .

Nature

419

498

–511.

Apweiler,R., Bairoch,A., Wu,C.H., Barker,W.C., Boeckmann,B., Ferro,S., Gasteiger,E., Huang,H., Lopez,R., Magrane,M. et al . (

2004

) UniProt: the Universal Protein knowledgebase.

Nucleic Acids Res.

(Database issue),

D115

–D119.

Stajich,J.E., Block,D., Boulez,K., Brenner,S.E., Chervitz,S.A., Dagdigian,C., Fuellen,G., Gilbert,J.G., Korf,I., Lapp,H. et al . (

2002

) The Bioperl toolkit: Perl modules for the life sciences.

Genome Res.

1611

–1618.

Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D.J. (

1997

) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.

Nucleic Acids Res.

3389

–3402.

Author notes

Center for Genomics and Bioinformatics, Karolinska Institutet, S-171 77 Stockholm, Sweden and 1Estonian Biocentre and Department of Bioinformatics, Institute of Molecular and Cell Biology, University of Tartu, Estonia

I agree to the terms and conditions. You must accept the terms and conditions.

Submit a comment

Name

Affiliations

Comment title

Comment

You have entered an invalid code

Thank you for submitting a comment on this article. Your comment will be reviewed and published at the journal's discretion. Please check for further notifications by email.

Citations

Views

Altmetric

Metrics

Total Views 3,881

2,997 Pageviews

884 PDF Downloads

Since 11/1/2016

Month:	Total Views:
November 2016	1
December 2016	4
January 2017	17
February 2017	20
March 2017	18
April 2017	13
May 2017	19
June 2017	23
July 2017	17
August 2017	16
September 2017	10
October 2017	12
November 2017	35
December 2017	54
January 2018	48
February 2018	59
March 2018	47
April 2018	39
May 2018	51
June 2018	62
July 2018	62
August 2018	65
September 2018	47
October 2018	53
November 2018	32
December 2018	33
January 2019	31
February 2019	38
March 2019	51
April 2019	40
May 2019	48
June 2019	44
July 2019	44
August 2019	52
September 2019	52
October 2019	43
November 2019	49
December 2019	46
January 2020	28
February 2020	35
March 2020	50
April 2020	22
May 2020	24
June 2020	38
July 2020	39
August 2020	51
September 2020	39
October 2020	54
November 2020	38
December 2020	46
January 2021	24
February 2021	64
March 2021	65
April 2021	40
May 2021	31
June 2021	50
July 2021	24
August 2021	27
September 2021	31
October 2021	38
November 2021	39
December 2021	47
January 2022	40
February 2022	29
March 2022	52
April 2022	23
May 2022	34
June 2022	57
July 2022	40
August 2022	42
September 2022	39
October 2022	45
November 2022	30
December 2022	38
January 2023	40
February 2023	49
March 2023	99
April 2023	47
May 2023	54
June 2023	28
July 2023	65
August 2023	58
September 2023	31
October 2023	41
November 2023	33
December 2023	52
January 2024	47
February 2024	60
March 2024	68
April 2024	57
May 2024	45
June 2024	38
July 2024	34
August 2024	65
September 2024	40
October 2024	22

Citations

563 Web of Science

Inparanoid: a comprehensive database of eukaryotic orthologs (original) (raw)

Cite

Abstract

INTRODUCTION