Motif Enrichment Analysis: a unified framework and an evaluation on ChIP data - PubMed (original) (raw)

Motif Enrichment Analysis: a unified framework and an evaluation on ChIP data

Robert C McLeay et al. BMC Bioinformatics. 2010.

Abstract

Background: A major goal of molecular biology is determining the mechanisms that control the transcription of genes. Motif Enrichment Analysis (MEA) seeks to determine which DNA-binding transcription factors control the transcription of a set of genes by detecting enrichment of known binding motifs in the genes' regulatory regions. Typically, the biologist specifies a set of genes believed to be co-regulated and a library of known DNA-binding models for transcription factors, and MEA determines which (if any) of the factors may be direct regulators of the genes. Since the number of factors with known DNA-binding models is rapidly increasing as a result of high-throughput technologies, MEA is becoming increasingly useful. In this paper, we explore ways to make MEA applicable in more settings, and evaluate the efficacy of a number of MEA approaches.

Results: We first define a mathematical framework for Motif Enrichment Analysis that relaxes the requirement that the biologist input a selected set of genes. Instead, the input consists of all regulatory regions, each labeled with the level of a biological signal. We then define and implement a number of motif enrichment analysis methods. Some of these methods require a user-specified signal threshold, some identify an optimum threshold in a data-driven way and two of our methods are threshold-free. We evaluate these methods, along with two existing methods (Clover and PASTAA), using yeast ChIP-chip data. Our novel threshold-free method based on linear regression performs best in our evaluation, followed by the data-driven PASTAA algorithm. The Clover algorithm performs as well as PASTAA if the user-specified threshold is chosen optimally. Data-driven methods based on three statistical tests-Fisher Exact Test, rank-sum test, and multi-hypergeometric test--perform poorly, even when the threshold is chosen optimally. These methods (and Clover) perform even worse when unrestricted data-driven threshold determination is used.

Conclusions: Our novel, threshold-free linear regression method works well on ChIP-chip data. Methods using data-driven threshold determination can perform poorly unless the range of thresholds is limited a priori. The limits implemented in PASTAA, however, appear to be well-chosen. Our novel algorithms--AME (Analysis of Motif Enrichment)-are available at http://bioinformatics.org.au/ame/.

PubMed Disclaimer

Figures

Figure 1

Figure 1

Accuracy of MEA methods using fixed Y partitions. The ability of different MEA methods to correctly rank the known TF motif in 237 yeast ChIP-chip experiments is shown. Each point corresponds to the mean (Panel a) or the median (Panel b) percentile rank accuracy (PRA) of an MEA method on all ChIP-chip datasets that contain at least one sequence with a fluorescence _p_-value less than the value of t y (_X_-axis). Increasing X values correspond to relaxing the threshold for a sequence to be considered bound by a TF. To the right of the vertical line, all 237 sets are included; to the left, increasingly fewer sets are included at stricter t y thresholds.

Figure 2

Figure 2

Accuracy of MEA methods using unconstrained-_Y_-partition-maximisation. The ability of different MEA methods to correctly rank the known TF motif in 237 yeast ChIP-chip experiments is shown. The mean percentile rank accuracy of unconstrained-_Y_-partition-maximization (YUPM, blue bars) and fixed-partition (YFP, red bars, t y = 0.001) variants of four MEA methods is shown. Error bars show standard error.

Figure 3

Figure 3

Accuracy of the mHG method constrained to at most 300 positive sequences. The ability of three variants of the mHG method to correctly rank the known TF motif in 237 yeast ChIP-chip experiments is shown. Each bar represents the mean PRA of versions of an MEA method. The bar labeled mHG-YDRIM shows accuracy using partition maximization, limited to partitions with a maximum of 300 "positive" sequences. The other two bars show accuracy using the fixed partition method with t y = 0.001 (mHG-YFP) and and unconstrained partition maximisation (mHG-YUPM), respectively.

Figure 4

Figure 4

Accuracy of MEA methods using constrained partition-maximization. The ability of different MEA methods to correctly rank the known TF motif in 237 yeast ChIP-chip experiments is shown. Each panel shows the accuracy of the Y constrained partition maximization (YCPM) of a method, along with the fixed partition (YFP) variant's accuracy for comparison. Each point shows the mean or median PRA (_Y_-axis) of the MEA method. For YCPM methods, the _X_-axis of the plot is the maximum value, b, that t y may assume; for YFP methods, it is the method's fixed threshold, t y.

Figure 5

Figure 5

Accuracy of a partition-free MEA method. The ability of different MEA method to correctly rank the known TF motif in 237 yeast ChIP-chip experiments is shown. Each bar shows the mean PRA of the given MEA method on all 237 ChIP-chip datasets. Error bars show standard error. The LR method is partition free. PASTAA uses X and Y constrained partition maximization with a maximum of 1000 sequences in the "positive" sets. All fixed-partition (YFP) methods use a threshold of t y = 0.001.

Similar articles

Cited by

References

    1. Frith MC, Fu Y, Yu L, Chen JF, Hansen U, Weng Z. Detection of functional DNA motifs via statistical over-representation. Nucleic Acids Res. 2004;32(4):1372–1381. doi: 10.1093/nar/gkh299. - DOI - PMC - PubMed
    1. Zambelli F, Pesole G, Pavesi G. Pscan: finding over-represented transcription factor binding site motifs in sequences from co-regulated or co-expressed genes. Nucl Acids Res. 2009;37(suppl_2):W247–252. doi: 10.1093/nar/gkp464. - DOI - PMC - PubMed
    1. Mukherjee S, Berger MF, Jona G, Wang XS, Muzzey D, Snyder M, Young RA, Bulyk ML. Rapid analysis of the DNA-binding specificities of transcription factors with DNA microarrays. Nat Genet. 2004;36(12):1331–1339. doi: 10.1038/ng1473. - DOI - PMC - PubMed
    1. Hughes JD, Estep PW, Tavazoie S, Church GM. Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. J Mol Biol. 2000;296(5):1205–1214. doi: 10.1006/jmbi.2000.3519. - DOI - PubMed
    1. Elkon R, Linhart C, Sharan R, Shamir R, Shiloh Y. Genome-wide in silico identification of transcriptional regulators controlling the cell cycle in human cells. Genome Res. 2003;13(5):773–780. doi: 10.1101/gr.947203. - DOI - PMC - PubMed

Publication types

MeSH terms

Substances

LinkOut - more resources