MetaPhOrs: orthology and paralogy predictions from multiple phylogenetic evidence using a consistency-based confidence score - PubMed (original) (raw)

MetaPhOrs: orthology and paralogy predictions from multiple phylogenetic evidence using a consistency-based confidence score

Leszek P Pryszcz et al. Nucleic Acids Res. 2011 Mar.

Abstract

Reliable prediction of orthology is central to comparative genomics. Approaches based on phylogenetic analyses closely resemble the original definition of orthology and paralogy and are known to be highly accurate. However, the large computational cost associated to these analyses is a limiting factor that often prevents its use at genomic scales. Recently, several projects have addressed the reconstruction of large collections of high-quality phylogenetic trees from which orthology and paralogy relationships can be inferred. This provides us with the opportunity to infer the evolutionary relationships of genes from multiple, independent, phylogenetic trees. Using such strategy, we combine phylogenetic information derived from different databases, to predict orthology and paralogy relationships for 4.1 million proteins in 829 fully sequenced genomes. We show that the number of independent sources from which a prediction is made, as well as the level of consistency across predictions, can be used as reliable confidence scores. A webserver has been developed to easily access these data (http://orthology.phylomedb.org), which provides users with a global repository of phylogeny-based orthology and paralogy predictions.

PubMed Disclaimer

Figures

Figure 1.

Figure 1.

Parameter evaluation. The accuracy of predictions were investigated applying various cut-offs for the likelihood filter (A), orthology consistency score (B) and EL (C). The harmonic mean (F1.0, precision and recall equally weighted, see ‘Materials and Methods’ section) was calculated based on a subset of TreeFam-A reference set for human–mouse, human–zebra fish and human–fruit fly [100 orthogroups as in (24)] and YGOB reference set for S. cerevisiae-C. glabrata, and S. cerevisiae-A. gossypii. For the sets evaluated on TreeFam-A benchmark we did not use trees coming from this database.

Figure 2.

Figure 2.

MetaPhOrs statistics. The orthology assignments for 829 complete genomes were mapped onto the tree of life (NCBI taxonomy tree). Bar charts around the tree represent the fraction of each genome for which orthologs have been identified (green) and with no orthologs identified (grey). The total length of each bar (grey + green fractions) is proportional to the logarithm of the number of genes in the genome. A higher resolution, interactive figure, showing the coverage of each independent dataset (PhylomeDB, Ensembl, EggNOG, Fungal Orthogroups, COG and TreeFAM) is available online (MetaPhOrs Overview at:

http://orthology.phylomedb.org/?q=stats

). The figure was constructed using iTOL MetaPhOrs statistics. Detailed statistics of MetaPhOrs and all of its subsequent databases are provided in

Supplementary Table S2

.

Figure 3.

Figure 3.

Accuracy of the MetaPhOrs approach using different datasets. Recall and precision scores of our pipeline applied to individual datasets (blue rhombus), combined datasets (red squares) and the full MetaPhOrs approach (black double triangles) were calculated based on TreeFam-A reference set (see ‘Materials and Methods’ section). Note that results on accuracy do not correspond to predictions as given by a given repository (e.g. OrthoMCL), but to our phylogeny-based approach based on trees derived for data contained in such repository (e.g. species-overlap algorithm applied on trees derived from OrthoMCL families). In order to avoid circularity in our benchmark, trees coming from TreeFam-A were not considered in any dataset. For the combined methods, predictions from two or more sources were summed together: orthology was assigned if confirmed by at least one repository, paralogy was assumed only if there were more paralogy signals than orthology. For the full MetaPhOrs approach, we used several level (EL) thresholds; for instance, for EL = 2 only predictions confirmed by any combination of 2 independent sources (phylomes or databases) are taken into account. A consistency threshold (CSo) of 0.5 is applied. Plotted curves represent combinations of recall and precision providing identical Fβ scores as the best performing method. Two scenarios are considered: recall and precision are equally weighted (blue thin line, F1.0 = 0.817); or precision is two times more important than recall (grey thick line, F0.5 = 0.837). The ranking of the best methods can be defined based on relative distance of each method to the curve representing F score of the best scoring method. MetaPhOrs with EL cut-off of 2 (MO el = 2; F1.0 = 0.817) and MetaPhOrs with EL cut-off of 3 (MO el = 3; F1.0 = 0.797) are the best performing approaches in the first scenario. In the second scenario, MetaPhOrs with EL cut-off of 3 (MO el = 3; F0.5 = 0.837), MetaPhOrs with EL cut-off of 4 (MO el = 4; F0.5 = 0.824) and MetaPhOrs with Evidence level cut-off of 2 (MO el = 2; F0.5 = 0.807) perform the best.

Similar articles

Cited by

References

    1. Fitch WM. Distinguishing homologous from analogous proteins. Syst. Zool. 1970;19:99–113. - PubMed
    1. Gabaldón T, Dessimoz C, Huxley-Jones J, Vilella AJ, Sonnhammer EL, Lewis S. Joining forces in the quest for orthologs. Genome Biol. 2009;10:403. - PMC - PubMed
    1. Kuzniar A, van Ham RC, Pongor S, Leunissen JA. The quest for orthologs: finding the corresponding gene across genomes. Trends Genet. 2008;24:539–551. - PubMed
    1. Gabaldón T. Large-scale assignment of orthology: back to phylogenetics? Genome Biol. 2008;9:235. - PMC - PubMed
    1. Huerta-Cepas J, Dopazo H, Dopazo J, Gabaldón T. The human phylome. Genome Biol. 2007;8:R109. - PMC - PubMed

Publication types

MeSH terms

Substances

LinkOut - more resources