Benchmarking ortholog identification methods using functional genomics data - PubMed (original) (raw)

Benchmarking ortholog identification methods using functional genomics data

Tim Hulsen et al. Genome Biol. 2006.

Abstract

Background: The transfer of functional annotations from model organism proteins to human proteins is one of the main applications of comparative genomics. Various methods are used to analyze cross-species orthologous relationships according to an operational definition of orthology. Often the definition of orthology is incorrectly interpreted as a prediction of proteins that are functionally equivalent across species, while in fact it only defines the existence of a common ancestor for a gene in different species. However, it has been demonstrated that orthologs often reveal significant functional similarity. Therefore, the quality of the orthology prediction is an important factor in the transfer of functional annotations (and other related information). To identify protein pairs with the highest possible functional similarity, it is important to qualify ortholog identification methods.

Results: To measure the similarity in function of proteins from different species we used functional genomics data, such as expression data and protein interaction data. We tested several of the most popular ortholog identification methods. In general, we observed a sensitivity/selectivity trade-off: the functional similarity scores per orthologous pair of sequences become higher when the number of proteins included in the ortholog groups decreases.

Conclusion: By combining the sensitivity and the selectivity into an overall score, we show that the InParanoid program is the best ortholog identification method in terms of identifying functionally equivalent proteins.

PubMed Disclaimer

Figures

Figure 1

Correlation in expression profiles. Correlation in expression patterns between the (a) human-mouse (Hs-Mm) and (b) human-worm (Hs-Ce) orthologous pairs from the benchmarked methods versus the average proteome size. Vertical error bars show the standard deviation from the average correlation coefficient. The trendline shown is a linear regression trendline. The methods having a fourth letter 'B' behind the method name, shown as squares in the graph, are group orthology methods in which only the best scoring pairs are taken into account.

Figure 2

Equal InterPro accession number. Conservation of InterPro accession number between the (a) human-mouse (Hs-Mm) and (b) human-worm (Hs-Ce) orthologous pairs from the benchmarked methods versus the average proteome size.

Figure 3

Conservation of co-expression. Conservation of co-expression from human-human gene pairs to orthologous (a) mouse-mouse and (b) worm-worm gene pairs from the benchmarked methods versus the average proteome size. Ce, Caenorhabditis elegans; Hs, Homo sapiens; Mm, Mus musculus.

Figure 4

Conservation of gene order. Conservation of gene order from human-human gene pairs to orthologous (a) mouse-mouse and (b) worm-worm gene pairs from the benchmarked methods versus the average proteome size. Ce, Caenorhabditis elegans; Hs, Homo sapiens.

Figure 5

Conservation of protein-protein interaction. Conservation of protein-protein interaction from human-human protein pairs to orthologous (a) mouse-mouse and (b) worm-worm protein pairs from the benchmarked methods versus the average proteome size. Ce, Caenorhabditis elegans; Hs, Homo sapiens.

Figure 6

Overall scoring graph. Overall scoring graph, created by adding up all normalized benchmarking scores per ortholog identification method. X-axis, the several ortholog identification methods, sorted by average proteome size or number of protein pairs; Y-axis, the sum of all five benchmarking scores per ortholog identification method. Red, correlation of expression profiles; green, equal InterPro accession numbers; blue, conservation of co-expression; orange, conservation of gene order; purple, conservation of protein-protein interaction. (a) Human-mouse (Hs-Mm). (b) Human-worm (Hs-Ce).

Cited by

A new chromosome-level genome assembly and annotation of Cryptosporidium meleagridis.
Penumarthi LR, Baptista RP, Beaudry MS, Glenn TC, Kissinger JC. Penumarthi LR, et al. bioRxiv [Preprint]. 2024 Feb 17:2024.02.16.580748. doi: 10.1101/2024.02.16.580748. bioRxiv. 2024. PMID: 38405792 Free PMC article. Preprint.
Hybrid Deep Learning Based on a Heterogeneous Network Profile for Functional Annotations of Plasmodium falciparum Genes.
Suratanee A, Plaimas K. Suratanee A, et al. Int J Mol Sci. 2021 Sep 16;22(18):10019. doi: 10.3390/ijms221810019. Int J Mol Sci. 2021. PMID: 34576183 Free PMC article.
Genome-Wide Analysis of Four Pathotypes of Wheat Rust Pathogen (Puccinia graminis) Reveals Structural Variations and Diversifying Selection.
Kiran K, Rawal HC, Dubey H, Jaswal R, Bhardwaj SC, Deshmukh R, Sharma TR. Kiran K, et al. J Fungi (Basel). 2021 Aug 27;7(9):701. doi: 10.3390/jof7090701. J Fungi (Basel). 2021. PMID: 34575739 Free PMC article.
KinOrtho: a method for mapping human kinase orthologs across the tree of life and illuminating understudied kinases.
Huang LC, Taujale R, Gravel N, Venkat A, Yeung W, Byrne DP, Eyers PA, Kannan N. Huang LC, et al. BMC Bioinformatics. 2021 Sep 18;22(1):446. doi: 10.1186/s12859-021-04358-3. BMC Bioinformatics. 2021. PMID: 34537014 Free PMC article.
Domestication Shapes the Community Structure and Functional Metagenomic Content of the Yak Fecal Microbiota.
Fu H, Zhang L, Fan C, Liu C, Li W, Li J, Zhao X, Jia S, Zhang Y. Fu H, et al. Front Microbiol. 2021 Mar 31;12:594075. doi: 10.3389/fmicb.2021.594075. eCollection 2021. Front Microbiol. 2021. PMID: 33897627 Free PMC article.

References

1. Fitch WM. Distinguishing homologous from analogous proteins. Syst Zool. 1970;19:99–113. - PubMed
1. Li WH, Yang J, Gu X. Expression divergence between duplicate genes. Trends Genet. 2005;21:602–607. doi: 10.1016/j.tig.2005.08.006. - DOI - PubMed
1. Mirny LA, Gelfand MS. Using orthologous and paralogous proteins to identify specificity determining residues. Genome Biol. 2002;3:PREPRINT0002. doi: 10.1186/gb-2002-3-3-preprint0002. - DOI - PubMed
1. Chimpanzee sequencing whitepaper http://genome.wustl.edu/ancillary/data/whitepapers/Pan_troglodytes_WP2.pdf
1. Tatusov RL, Galperin MY, Natale DA, Koonin EV. The COG database: a tool for genome-scale analysis of protein functions and evolution. Nucleic Acids Res. 2000;28:33–36. doi: 10.1093/nar/28.1.33. - DOI - PMC - PubMed

Benchmarking ortholog identification methods using functional genomics data - PubMed (original) (raw)