Benchmarking ortholog identification methods using functional genomics data - PubMed (original) (raw)

Benchmarking ortholog identification methods using functional genomics data

Tim Hulsen et al. Genome Biol. 2006.

Abstract

Background: The transfer of functional annotations from model organism proteins to human proteins is one of the main applications of comparative genomics. Various methods are used to analyze cross-species orthologous relationships according to an operational definition of orthology. Often the definition of orthology is incorrectly interpreted as a prediction of proteins that are functionally equivalent across species, while in fact it only defines the existence of a common ancestor for a gene in different species. However, it has been demonstrated that orthologs often reveal significant functional similarity. Therefore, the quality of the orthology prediction is an important factor in the transfer of functional annotations (and other related information). To identify protein pairs with the highest possible functional similarity, it is important to qualify ortholog identification methods.

Results: To measure the similarity in function of proteins from different species we used functional genomics data, such as expression data and protein interaction data. We tested several of the most popular ortholog identification methods. In general, we observed a sensitivity/selectivity trade-off: the functional similarity scores per orthologous pair of sequences become higher when the number of proteins included in the ortholog groups decreases.

Conclusion: By combining the sensitivity and the selectivity into an overall score, we show that the InParanoid program is the best ortholog identification method in terms of identifying functionally equivalent proteins.

PubMed Disclaimer

Figures

Figure 1

Figure 1

Correlation in expression profiles. Correlation in expression patterns between the (a) human-mouse (Hs-Mm) and (b) human-worm (Hs-Ce) orthologous pairs from the benchmarked methods versus the average proteome size. Vertical error bars show the standard deviation from the average correlation coefficient. The trendline shown is a linear regression trendline. The methods having a fourth letter 'B' behind the method name, shown as squares in the graph, are group orthology methods in which only the best scoring pairs are taken into account.

Figure 2

Figure 2

Equal InterPro accession number. Conservation of InterPro accession number between the (a) human-mouse (Hs-Mm) and (b) human-worm (Hs-Ce) orthologous pairs from the benchmarked methods versus the average proteome size.

Figure 3

Figure 3

Conservation of co-expression. Conservation of co-expression from human-human gene pairs to orthologous (a) mouse-mouse and (b) worm-worm gene pairs from the benchmarked methods versus the average proteome size. Ce, Caenorhabditis elegans; Hs, Homo sapiens; Mm, Mus musculus.

Figure 4

Figure 4

Conservation of gene order. Conservation of gene order from human-human gene pairs to orthologous (a) mouse-mouse and (b) worm-worm gene pairs from the benchmarked methods versus the average proteome size. Ce, Caenorhabditis elegans; Hs, Homo sapiens.

Figure 5

Figure 5

Conservation of protein-protein interaction. Conservation of protein-protein interaction from human-human protein pairs to orthologous (a) mouse-mouse and (b) worm-worm protein pairs from the benchmarked methods versus the average proteome size. Ce, Caenorhabditis elegans; Hs, Homo sapiens.

Figure 6

Figure 6

Overall scoring graph. Overall scoring graph, created by adding up all normalized benchmarking scores per ortholog identification method. X-axis, the several ortholog identification methods, sorted by average proteome size or number of protein pairs; Y-axis, the sum of all five benchmarking scores per ortholog identification method. Red, correlation of expression profiles; green, equal InterPro accession numbers; blue, conservation of co-expression; orange, conservation of gene order; purple, conservation of protein-protein interaction. (a) Human-mouse (Hs-Mm). (b) Human-worm (Hs-Ce).

Similar articles

Cited by

References

    1. Fitch WM. Distinguishing homologous from analogous proteins. Syst Zool. 1970;19:99–113. - PubMed
    1. Li WH, Yang J, Gu X. Expression divergence between duplicate genes. Trends Genet. 2005;21:602–607. doi: 10.1016/j.tig.2005.08.006. - DOI - PubMed
    1. Mirny LA, Gelfand MS. Using orthologous and paralogous proteins to identify specificity determining residues. Genome Biol. 2002;3:PREPRINT0002. doi: 10.1186/gb-2002-3-3-preprint0002. - DOI - PubMed
    1. Chimpanzee sequencing whitepaper http://genome.wustl.edu/ancillary/data/whitepapers/Pan_troglodytes_WP2.pdf
    1. Tatusov RL, Galperin MY, Natale DA, Koonin EV. The COG database: a tool for genome-scale analysis of protein functions and evolution. Nucleic Acids Res. 2000;28:33–36. doi: 10.1093/nar/28.1.33. - DOI - PMC - PubMed

Publication types

MeSH terms

Substances

LinkOut - more resources