MetaPhOrs: orthology and paralogy predictions from multiple phylogenetic evidence using a consistency-based confidence score - PubMed (original) (raw)

MetaPhOrs: orthology and paralogy predictions from multiple phylogenetic evidence using a consistency-based confidence score

Leszek P Pryszcz et al. Nucleic Acids Res. 2011 Mar.

Abstract

Reliable prediction of orthology is central to comparative genomics. Approaches based on phylogenetic analyses closely resemble the original definition of orthology and paralogy and are known to be highly accurate. However, the large computational cost associated to these analyses is a limiting factor that often prevents its use at genomic scales. Recently, several projects have addressed the reconstruction of large collections of high-quality phylogenetic trees from which orthology and paralogy relationships can be inferred. This provides us with the opportunity to infer the evolutionary relationships of genes from multiple, independent, phylogenetic trees. Using such strategy, we combine phylogenetic information derived from different databases, to predict orthology and paralogy relationships for 4.1 million proteins in 829 fully sequenced genomes. We show that the number of independent sources from which a prediction is made, as well as the level of consistency across predictions, can be used as reliable confidence scores. A webserver has been developed to easily access these data (http://orthology.phylomedb.org), which provides users with a global repository of phylogeny-based orthology and paralogy predictions.

PubMed Disclaimer

Figures

Figure 1.

Parameter evaluation. The accuracy of predictions were investigated applying various cut-offs for the likelihood filter (A), orthology consistency score (B) and EL (C). The harmonic mean (F1.0, precision and recall equally weighted, see ‘Materials and Methods’ section) was calculated based on a subset of TreeFam-A reference set for human–mouse, human–zebra fish and human–fruit fly [100 orthogroups as in (24)] and YGOB reference set for S. cerevisiae-C. glabrata, and S. cerevisiae-A. gossypii. For the sets evaluated on TreeFam-A benchmark we did not use trees coming from this database.

Figure 2.

MetaPhOrs statistics. The orthology assignments for 829 complete genomes were mapped onto the tree of life (NCBI taxonomy tree). Bar charts around the tree represent the fraction of each genome for which orthologs have been identified (green) and with no orthologs identified (grey). The total length of each bar (grey + green fractions) is proportional to the logarithm of the number of genes in the genome. A higher resolution, interactive figure, showing the coverage of each independent dataset (PhylomeDB, Ensembl, EggNOG, Fungal Orthogroups, COG and TreeFAM) is available online (MetaPhOrs Overview at:

http://orthology.phylomedb.org/?q=stats

). The figure was constructed using iTOL MetaPhOrs statistics. Detailed statistics of MetaPhOrs and all of its subsequent databases are provided in

Supplementary Table S2

Figure 3.

Accuracy of the MetaPhOrs approach using different datasets. Recall and precision scores of our pipeline applied to individual datasets (blue rhombus), combined datasets (red squares) and the full MetaPhOrs approach (black double triangles) were calculated based on TreeFam-A reference set (see ‘Materials and Methods’ section). Note that results on accuracy do not correspond to predictions as given by a given repository (e.g. OrthoMCL), but to our phylogeny-based approach based on trees derived for data contained in such repository (e.g. species-overlap algorithm applied on trees derived from OrthoMCL families). In order to avoid circularity in our benchmark, trees coming from TreeFam-A were not considered in any dataset. For the combined methods, predictions from two or more sources were summed together: orthology was assigned if confirmed by at least one repository, paralogy was assumed only if there were more paralogy signals than orthology. For the full MetaPhOrs approach, we used several level (EL) thresholds; for instance, for EL = 2 only predictions confirmed by any combination of 2 independent sources (phylomes or databases) are taken into account. A consistency threshold (CSo) of 0.5 is applied. Plotted curves represent combinations of recall and precision providing identical Fβ scores as the best performing method. Two scenarios are considered: recall and precision are equally weighted (blue thin line, F1.0 = 0.817); or precision is two times more important than recall (grey thick line, F0.5 = 0.837). The ranking of the best methods can be defined based on relative distance of each method to the curve representing F score of the best scoring method. MetaPhOrs with EL cut-off of 2 (MO el = 2; F1.0 = 0.817) and MetaPhOrs with EL cut-off of 3 (MO el = 3; F1.0 = 0.797) are the best performing approaches in the first scenario. In the second scenario, MetaPhOrs with EL cut-off of 3 (MO el = 3; F0.5 = 0.837), MetaPhOrs with EL cut-off of 4 (MO el = 4; F0.5 = 0.824) and MetaPhOrs with Evidence level cut-off of 2 (MO el = 2; F0.5 = 0.807) perform the best.

Cited by

Cross-species meta-analysis of transcriptome changes during the morula-to-blastocyst transition: metabolic and physiological changes take center stage.
Schall PZ, Latham KE. Schall PZ, et al. Am J Physiol Cell Physiol. 2021 Dec 1;321(6):C913-C931. doi: 10.1152/ajpcell.00318.2021. Epub 2021 Oct 20. Am J Physiol Cell Physiol. 2021. PMID: 34669511 Free PMC article.
A Prioritized and Validated Resource of Mitochondrial Proteins in Plasmodium Identifies Unique Biology.
van Esveld SL, Meerstein-Kessel L, Boshoven C, Baaij JF, Barylyuk K, Coolen JPM, van Strien J, Duim RAJ, Dutilh BE, Garza DR, Letterie M, Proellochs NI, de Ridder MN, Venkatasubramanian PB, de Vries LE, Waller RF, Kooij TWA, Huynen MA. van Esveld SL, et al. mSphere. 2021 Oct 27;6(5):e0061421. doi: 10.1128/mSphere.00614-21. Epub 2021 Sep 8. mSphere. 2021. PMID: 34494883 Free PMC article.
Role of epigenetics in unicellular to multicellular transition in Dictyostelium.
Wang SY, Pollina EA, Wang IH, Pino LK, Bushnell HL, Takashima K, Fritsche C, Sabin G, Garcia BA, Greer PL, Greer EL. Wang SY, et al. Genome Biol. 2021 May 4;22(1):134. doi: 10.1186/s13059-021-02360-9. Genome Biol. 2021. PMID: 33947439 Free PMC article.
A Workflow for Selection of Single Nucleotide Polymorphic Markers for Studying of Genetics of Ischemic Stroke Outcomes.
Khvorykh G, Khrunin A, Filippenkov I, Stavchansky V, Dergunova L, Limborska S. Khvorykh G, et al. Genes (Basel). 2021 Feb 25;12(3):328. doi: 10.3390/genes12030328. Genes (Basel). 2021. PMID: 33668793 Free PMC article.
Structure and function of the vacuolar Ccc1/VIT1 family of iron transporters and its regulation in fungi.
Sorribes-Dauden R, Peris D, Martínez-Pastor MT, Puig S. Sorribes-Dauden R, et al. Comput Struct Biotechnol J. 2020 Nov 23;18:3712-3722. doi: 10.1016/j.csbj.2020.10.044. eCollection 2020. Comput Struct Biotechnol J. 2020. PMID: 33304466 Free PMC article. Review.

References

1. Fitch WM. Distinguishing homologous from analogous proteins. Syst. Zool. 1970;19:99–113. - PubMed
1. Gabaldón T, Dessimoz C, Huxley-Jones J, Vilella AJ, Sonnhammer EL, Lewis S. Joining forces in the quest for orthologs. Genome Biol. 2009;10:403. - PMC - PubMed
1. Kuzniar A, van Ham RC, Pongor S, Leunissen JA. The quest for orthologs: finding the corresponding gene across genomes. Trends Genet. 2008;24:539–551. - PubMed
1. Gabaldón T. Large-scale assignment of orthology: back to phylogenetics? Genome Biol. 2008;9:235. - PMC - PubMed
1. Huerta-Cepas J, Dopazo H, Dopazo J, Gabaldón T. The human phylome. Genome Biol. 2007;8:R109. - PMC - PubMed

MetaPhOrs: orthology and paralogy predictions from multiple phylogenetic evidence using a consistency-based confidence score - PubMed (original) (raw)