Predicting genes for orphan metabolic activities using phylogenetic profiles - PubMed (original) (raw)

Predicting genes for orphan metabolic activities using phylogenetic profiles

Lifeng Chen et al. Genome Biol. 2006.

Abstract

Homology-based methods fail to assign genes to many metabolic activities present in sequenced organisms. To suggest genes for these orphan activities we developed a novel method that efficiently combines local structure of a metabolic network with phylogenetic profiles. We validated our method using known metabolic genes in Saccharomyces cerevisiae and Escherichia coli. We show that our method should be easily transferable to other organisms, and that it is robust to errors in incomplete metabolic networks.

PubMed Disclaimer

Figures

Figure 1

Figure 1

The average phylogenetic correlation between a target gene and all other network genes at a certain metabolic network distance. The standard deviation of the average correlation for all possible network gaps is represented by the error bars. The dashed line shows the background correlation, estimated by the average phylogenetic correlation between any metabolic and non-metabolic genes. The average phylogenetic correlation between two genes decreases monotonically with their separation in the network.

Figure 2

Figure 2

'Fit' test of a candidate gene in a network gap. We use a self-consistent test in which a known gene E4 is removed from the network, leaving a gap in its place. We then: 1, put candidate genes in the gap one by one; 2, determine the function value for every candidate gene (Equations 1 to 3); and 3, rank all candidate genes based on their function values. In the figure we show an example when the correct gene E4 was ranked as number 6.

Figure 3

Figure 3

Enzyme predictions based on phylogenetic profiles. (a) The cumulative fraction of correctly predicted genes as a function of rank among all non-metabolic genes. All 6,093 non-metabolic yeast genes plus a known correct gene were ranked using Equation 2. The cumulative distribution is shown for ranks from 1 to 100; the inset shows the same distribution for all ranks. (b) The effect of connection specificity adjustment. Only highly ranked genes (1 to 50) are shown. (c) Comparison of the performance with all non-metabolic genes as candidates to that with only hypothetical genes as candidates for an orphan activity. (d) Predictions for the E. coli metabolic network. The cost function with the parameters optimized for the yeast network showed comparable performance to the cost function with the parameters specifically optimized for the E. coli network.

Figure 4

Figure 4

Importance of metabolic neighborhood for the predictive power of the algorithm. (a) Informative and non-informative gaps. About one-third of the gaps did not allow any discrimination between the correct and average genes (represented by bin 0 in the figure), that is, the function value of the correct gene is equal to or smaller than the function value for average genes determined by Equation 2. The red line shows the average rank of correct genes represented in each bin. Genes filling gaps with higher discrimination ratios are ranked higher by the algorithm. (b) The relationship between the rank of a correct enzyme in a gap and the average correlation of first layer genes around the gap. A metabolic gene for a gap with a high average first layer correlation (>0.5) is usually highly ranked by the prediction algorithm (black line) but the fraction of such gaps is small (red bins).

Figure 5

Figure 5

The algorithm performance using an incomplete metabolic network. We show the algorithm performance for yeast networks with a certain fraction of genes randomly deleted. The performance decrease is gradual as up to 50% of the network nodes are deleted. For example, when half of the network is deleted, we can still predict more than 33% of the correct metabolic genes within the top 50 among all candidate genes, compared to 0.8% by random chance.

Figure 6

Figure 6

Context-based associations versus the metabolic network distance for the yeast metabolic network. (a) mRNA expression distance. The expression distance is calculated as 1-|correlation|, where correlation is the Spearman's rank correlation between genes' mRNA expression. Close neighbors in the metabolic network have similar expression profiles. (b) Gene fusion events (Rosetta Stone). The fraction of proteins involved in gene fusion events. The adjacent genes in the network are much more likely to form a Rosetta Stone protein. (c) Phylogenetic profiles. Pearson's correlations between phylogenetic profiles for genes close in the network are more likely to be similar. (d) Chromosomal distance between genes. The mean physical distances (in kilobase pairs (kbp)) between ORFs are shown. The adjacent genes in the network are significantly closer to each other on yeast chromosomes.

Figure 7

Figure 7

Construction of a network from a list of metabolic reactions. The direct connections are established between the dependency pairs: gene pairs sharing metabolites (M) as reactants or products. An orphan activity (metabolic network gap) is marked by a question mark and surrounded by known metabolic genes. The first and second network layers around the gap are colored yellow and blue, respectively. E, enzyme.

Similar articles

Cited by

References

    1. Krieger CJ, Zhang P, Mueller LA, Wang A, Paley S, Arnaud M, Pick J, Rhee SY, Karp PD. MetaCyc: a multiorganism database of metabolic pathways and enzymes. Nucleic Acids Res. 2004;(32 Database):D438–442. doi: 10.1093/nar/gkh100. - DOI - PMC - PubMed
    1. Schomburg I, Chang A, Ebeling C, Gremse M, Heldt C, Huhn G, Schomburg D. BRENDA, the enzyme database: updates and major new developments. Nucleic Acids Res. 2004;(32 Database):D431–433. doi: 10.1093/nar/gkh081. - DOI - PMC - PubMed
    1. Rost B. Enzyme function less conserved than anticipated. J Mol Biol. 2002;318:595–608. doi: 10.1016/S0022-2836(02)00016-5. - DOI - PubMed
    1. Tian W, Skolnick J. How well is enzyme function conserved as a function of pairwise sequence identity? J Mol Biol. 2003;333:863–882. doi: 10.1016/j.jmb.2003.08.057. - DOI - PubMed
    1. Wilson CA, Kreychman J, Gerstein M. Assessing annotation transfer for genomics: quantifying the relations between protein sequence, structure and function through traditional and probabilistic scores. J Mol Biol. 2000;297:233–249. doi: 10.1006/jmbi.2000.3550. - DOI - PubMed

MeSH terms

LinkOut - more resources