PhyME: a probabilistic algorithm for finding motifs in sets of orthologous sequences - PubMed (original) (raw)

Comparative Study

PhyME: a probabilistic algorithm for finding motifs in sets of orthologous sequences

Saurabh Sinha et al. BMC Bioinformatics. 2004.

Abstract

Background: This paper addresses the problem of discovering transcription factor binding sites in heterogeneous sequence data, which includes regulatory sequences of one or more genes, as well as their orthologs in other species.

Results: We propose an algorithm that integrates two important aspects of a motif's significance - overrepresentation and cross-species conservation - into one probabilistic score. The algorithm allows the input orthologous sequences to be related by any user-specified phylogenetic tree. It is based on the Expectation-Maximization technique, and scales well with the number of species and the length of input sequences. We evaluate the algorithm on synthetic data, and also present results for data sets from yeast, fly, and human.

Conclusions: The results demonstrate that the new approach improves motif discovery by exploiting multiple species information.

PubMed Disclaimer

Figures

Figure 1

Figure 1

Orthologous promoters and blocks of sequence conservation. Shaded areas represent ungapped aligned blocks. _σ_1 is the reference species. (a) Alignment of input sequences and extraction of blocks. (b) Reorganization of input sequences.

Figure 2

Figure 2

Effect of varying the number of species (K) on motif-finding performance. The x-axis is the relative entropy (R) of the planted motif. Each point is an average over 10 experiments with synthetic data. (μ b = 0.3, μ m = 0.1.)

Figure 3

Figure 3

Effect of varying background and motif mutation rates (μ b and μ m respectively) on motif-finding performance. Each point is an average over 10 experiments with synthetic data. (K = 3, R = 12.)

Figure 4

Figure 4

Effect of the alignment step on motif-finding performance. The x-axis shows how many of the orthologous pairs of planted motifs are artificially unpaired in the alignment step. Each solid line represents a separate experiment. The squares plot the average score over eight experiments.

Figure 5

Figure 5

Effect of multiple species information on motif-discovery in the regulons RAP1, MIG1, CAR1, PHO4 and MCM1 in yeast. The y-axis plots the number of matches to the known motif, among the top η reported occurrences, where η is the number of known sites, plotted as "KNOWN" . Only matches in S. cerevisiae are considered.

Figure 6

Figure 6

Comparison of PhyME to 1 species and 2 species MEME, and to PhyloGibbs and EMnEM, for fly enhancers. The parenthetical number next to an enhancer name is the number of strong occurrences of the known weight matrix, in the D. melanogaster sequence.

Figure 7

Figure 7

Results on the human SP1 regulon. (a) The known motif. (b) Motif reported by PhyME, using mouse and rat orthologs. (c) The phylogenetic tree used by PhyME.

Figure 8

Figure 8

Results on the human c-Jun regulon. (a) The known motif. (b) Motif reported by PhyME, using mouse and rat orthologs.

Similar articles

Cited by

References

    1. Bailey TL, Elkan C. Unsupervised learning of multiple motifs in biopolymers using expectation maximization. Machine Learning. 1995;21:51–80. doi: 10.1023/A:1022617714621. - DOI
    1. Hertz GZ, Hartzell GW, III, Stormo GD. Identification of Consensus Patterns in Unaligned DNA Sequences Known to be Functionally Related. Computer Applications in the Biosciences. 1990;6:81–92. - PubMed
    1. Lawrence CE, Altschul SF, Boguski MS, Liu JS, Neuwald AF, Wootton JC. Detecting Subtle Sequence Signals: a Gibbs Sampling Strategy for Multiple Alignment. Science. 1993;262:208–214. - PubMed
    1. Roth FP, Hughes JD, Estep PW, Church GM. Finding DNA Regulatory Motifs Within Unaligned Noncoding Sequences Clustered by Whole-Genome mRNA Quantitation. Nature Biotechnology. 1998;16:939–945. doi: 10.1038/nbt1098-939. - DOI - PubMed
    1. Sinha S, Tompa M. A Statistical Method for Finding Transcription Factor Binding Sites. In Proceedings of the Eigth International Conference on Intelligent Systems for Molecular Biology: August 2000; La Jolla. 2000. pp. 344–354. - PubMed

Publication types

MeSH terms

Substances

LinkOut - more resources