A steganalysis-based approach to comprehensive identification and characterization of functional regulatory elements - PubMed (original) (raw)

Comparative Study

A steganalysis-based approach to comprehensive identification and characterization of functional regulatory elements

Guandong Wang et al. Genome Biol. 2006.

Abstract

The comprehensive identification of cis-regulatory elements on a genome scale is a challenging problem. We develop a novel, steganalysis-based approach for genome-wide motif finding, called WordSpy, by viewing regulatory regions as a stegoscript with cis-elements embedded in 'background' sequences. We apply WordSpy to the promoters of cell-cycle-related genes of Saccharomyces cerevisiae and Arabidopsis thaliana, identifying all known cell-cycle motifs with high ranking. WordSpy can discover a complete set of cis-elements and facilitate the systematic study of regulatory networks.

PubMed Disclaimer

Figures

Figure 1

Figure 1

A hidden Markov model for deciphering stegoscripts. It consists of two submodels, the 'secret message model' is for motifs and the 'covertext model' for background words. The blue boxes with dashed outlines each represent a word node, which is a combination of several position nodes. Node W b is a single-base node and always belongs to the covertext model. States S, B, and M do not emit any letter.

Figure 2

Figure 2

Components and flow diagram of WordSpy. Starting with k = 1 and a grammar _G_0 with a single word node W b in background, the algorithm goes through the following steps, represented by the red numbers on the figure. 1. Model G _k_-1 is optimized to formula image which contains over-represented motifs shorter than k. 2. Use formula image as a base model to detect over-represented exact words of length k. 3. Choose over-represented words for word clustering. 4. Evaluate all the words. Select and add background words to the background model. On the basis of similarity, cluster the rest of the words to form degenerate preliminary motifs. 5. Add the preliminary motifs to the motif sub-dictionary and create a new grammar G k. 6. Optimize G k. 7. Apply optimized formula image to decipher the script and locate motifs.

Figure 3

Figure 3

Distribution of discovered yeast motifs of length 8. The _x_-axis is the genome _Z_-score (Z _g_-score) of a motif, which measures the motif's specificity to the cell-cycle genes. Motifs resembling known ones are marked in blue.

Figure 4

Figure 4

Distribution of all discovered motifs from Arabidopsis cell-cycle-related genes. The _x_-axis is the genome _Z_-score (Z _g_-score) of a motif, which measures the motif's specificity to the cell-cycle genes. The _y_-axis is the _G_-score of a motif, which measures the coherency of the expression profiles of the genes whose promoters contain the motif.

Figure 5

Figure 5

Selected putative Arabidopsis cell-cycle-related motifs. ID, the ranking of a motif in the overall list. The third column gives the number of cell-cycle genes whose promoters contain the motif. The following four columns are the number of target genes in S and M phases of the cell cycle and the corresponding P value. GO analysis gives the functional group with the best P value, which is shown in the last column.

Figure 6

Figure 6

Distribution of the locations of putative Arabidopsis motifs. The location distribution of the top four putative motifs of length 7 in the promoters of Arabidopsis cell-cycle genes is shown.

Figure 7

Figure 7

Expression patterns of Arabidopsis genes associated with ACTAGCCGTT. The gene-expression profiles are highly coherent except three outliers - AT3G61640, AT5G13100, and AT5G23480. (a) Heat-map analysis of microarray expression patterns. (b) Profile analysis of microarray expression patterns. Expression profiles are clustered into two groups. The profiles in both red and blue have similar patterns, but the profiles in red have relatively low values.

Figure 8

Figure 8

Distribution of the positions of the motif ACTAGCCGTT in the promoters of Arabidopsis cell-cycle genes.

Figure 9

Figure 9

The results of a comparison of 14 motif-detection programs on a benchmark study [17]. At the nucleotide level, sensitivity (nSn), positive predictive value (nPPV), performance coefficient (nPC), and correlation coefficient (nCC) were measured. With nTP, nFN, nFP and nTN as nucleotide-level true positive, false negative, false positive, and true negative, respectively, nSn = nTP/(nTP + nFN); nPPV = nTP/(nTP + nFP); nPC = nTP/(nTP + nFN + nFP); and nCC = (nTP_·_nTN - nFN_·_nFP)/formula image. At the site level, sensitivity (sSn), positive predictive value (sPPV), and average site performance (sASP) were measured. With sTP, sFN, sFP as site-level true positive, false negative, and false positive, respectively, sSn = sTP/(sTP + sFN); sPPV) = sTP/(sTP + sFP; and sASP = (sSn + sPPV)/2.

Figure 10

Figure 10

True positives and false positives of the 14 motif-detection programs compared. (a) Nucleotide-level true positive (nTP) is the number of nucleotide positions in both known sites and predicted sites; nucleotide-level false positive (nFP) is the number of nucleotide positions not in known sites but in predicted sites. (b) Site-level true positive (sTP) is the number of known sites overlapped by predicted sites; site-level false positive (sFP) is the number of predicted sites not overlapped by known sites.

Similar articles

Cited by

References

    1. Lemon B, Tjian R. Orchestrated response: A symphony of transcription factors for gene control. Genes Dev. 2000;14:2551–2569. - PubMed
    1. Segal E, Yelensky R, Koller D. Genome-wide discovery of transcriptional modules from DNA sequence and gene expression. Bioinformatics. 2003;19 Suppl 1:273–282. - PubMed
    1. Tamada Y, Kim S, Bannai H, Imoto S, Tashiro K, Kuhara S, Miyano S. Estimating gene networks from gene expression data by combining Bayesian network model with promoter element detection. Bioinformatics. 2003;19 Suppl 2:II227–II236. - PubMed
    1. Lawrence C, Altschul S, Bogouski M, Liu J, Neuwald A, Wooten J. Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science. 1993;262:208–214. - PubMed
    1. Bailey T, Elkan C. Unsupervised learning of multiple motifs in biopolymers using EM. Machine Learning. 1995;21:51–80.

Publication types

MeSH terms

LinkOut - more resources