CisModule: de novo discovery of cis-regulatory modules by hierarchical mixture modeling - PubMed (original) (raw)

CisModule: de novo discovery of cis-regulatory modules by hierarchical mixture modeling

Qing Zhou et al. Proc Natl Acad Sci U S A. 2004.

Abstract

The regulatory information for a eukaryotic gene is encoded in cis-regulatory modules. The binding sites for a set of interacting transcription factors have the tendency to colocalize to the same modules. Current de novo motif discovery methods do not take advantage of this knowledge. We propose a hierarchical mixture approach to model the cis-regulatory module structure. Based on the model, a new de novo motif-module discovery algorithm, CisModule, is developed for the Bayesian inference of module locations and within-module motif sites. Dynamic programming-like recursions are developed to reduce the computational complexity from exponential to linear in sequence length. By using both simulated and real data sets, we demonstrate that CisModule is not only accurate in predicting modules but also more sensitive in detecting motif patterns and binding sites than standard motif discovery methods are.

PubMed Disclaimer

Figures

Fig. 1.

Fig. 1.

Specification of the HMx model. (A) Unaligned motif sites (triangles indexed by 1, 2,... ,5).(B) The aligned motif sites can be represented by a product multinomial model or equivalently by a PWM. Each binding site is regarded as a realization of a sequence of independent random variables X_1X2...Xw, where each Xi (i = 1,..., w) follows a multinomial distribution over the four letters {A,C,G,T} with probabilities θ_i = [θ_i_(A), θ_i_(C), θ_i_(G), θ_i_(T)]. The whole motif is thus specified by a set of multinomial probabilities Θ = [θ1, θ2,..., θ_w_]. (C) The cis-regulatory regions of coregulated genes are enriched for modules (the regions in the brackets). Each module is a sequence segment _x_1_x_2...xl in which several types of motifs (A, B, and C), each with its own product multinomial parameter (Θk), can occur. The rates of the occurrence of modules and their motif sites are denoted by r and qk (k = 1,..., K), respectively.

Fig. 2.

Fig. 2.

Algorithm for model fitting and motif-module identification. (A) Iterative sampling procedure. In parameter update (Left), we are given the locations of modules and motif sites. Therefore, we align the motif sites of the same type to update the PWM of that motif. In module and motif detection (Right), we use stochastic recursions (see Appendix B and text) to sample the locations of modules and motif sites, conditional on the updated parameter values. (B) The use of sampled module indicators for module identification. For each position i in the sequences, compute Pm(i) = the proportion of times during iterative sampling when position i is within a sampled module. The positions with Pm(i) > 0.5 (e.g., the regions [a,b] and [c,d]) are our predicted modules. See Fig. 3_A_ for further discussion.

Fig. 3.

Fig. 3.

Module prediction in the Drosophila data set. (A) Marginal posterior module probability (Pm) plots for example sequences in the three data sets of Drosophila homotypic modules. Pm is the probability of being sampled as within modules and it is plotted as a function of the position in the sequences (the solid curves). The horizontal broken lines correspond to Pm = 0.5, and the sequence bases with Pm > 0.5 are our predicted modules. The vertical lines are the motif sites predicted by CisModule. (B) Top site density of S(x) vs. cutoff value x. The broken vertical line at x = 0.5 corresponds to that of Pm = 0.5 in A.

Similar articles

Cited by

References

    1. Galas, D. J. & Schmitz, A. (1978) Nucleic Acids Res. 5, 3157-3170. - PMC - PubMed
    1. Fried, M. & Crothers, D. M. (1981) Nucleic Acids Res. 9, 6505-6525. - PMC - PubMed
    1. Garner, M. M. & Revzin, A. (1981) Nucleic Acids Res. 9, 3047-3060. - PMC - PubMed
    1. Stormo, G. D. & Hartzell, G. W. (1989) Proc. Natl. Acad. Sci. USA 86, 1183-1187. - PMC - PubMed
    1. Hertz, G. Z. & Stormo, G. D. (1999) Bioinformatics 15, 563-577. - PubMed

Publication types

MeSH terms

Substances

LinkOut - more resources