Empirical methods for controlling false positives and estimating confidence in ChIP-Seq peaks - PubMed (original) (raw)

Empirical methods for controlling false positives and estimating confidence in ChIP-Seq peaks

David A Nix et al. BMC Bioinformatics. 2008.

Abstract

Background: High throughput signature sequencing holds many promises, one of which is the ready identification of in vivo transcription factor binding sites, histone modifications, changes in chromatin structure and patterns of DNA methylation across entire genomes. In these experiments, chromatin immunoprecipitation is used to enrich for particular DNA sequences of interest and signature sequencing is used to map the regions to the genome (ChIP-Seq). Elucidation of these sites of DNA-protein binding/modification are proving instrumental in reconstructing networks of gene regulation and chromatin remodelling that direct development, response to cellular perturbation, and neoplastic transformation.

Results: Here we present a package of algorithms and software that makes use of control input data to reduce false positives and estimate confidence in ChIP-Seq peaks. Several different methods were compared using two simulated spike-in datasets. Use of control input data and a normalized difference score were found to more than double the recovery of ChIP-Seq peaks at a 5% false discovery rate (FDR). Moreover, both a binomial p-value/q-value and an empirical FDR were found to predict the true FDR within 2-3 fold and are more reliable estimators of confidence than a global Poisson p-value. These methods were then used to reanalyze Johnson et al.'s neuron-restrictive silencer factor (NRSF) ChIP-Seq data without relying on extensive qPCR validated NRSF sites and the presence of NRSF binding motifs for setting thresholds.

Conclusion: The methods developed and tested here show considerable promise for reducing false positives and estimating confidence in ChIP-Seq data without any prior knowledge of the chIP target. They are part of a larger open source package freely available from http://useq.sourceforge.net/.

PubMed Disclaimer

Figures

Figure 1

Figure 1

Systematic bias. Integrated Genome Browser display of chromosome 1 showing four window read count tracks derived from Johnson et al.'s NRSF H. sapiens ChIP-Seq data (A). The datasets were sub sampled to contain matching number of reads and the number of reads falling within a sliding 100 bp window plotted across each chromosome. Both the unamplified and amplified control input datasets show both obvious and subtle regions with an above random number of mapped reads. Expanded views of the input data track pericentric heterochromatic regions on chromosomes 1 (B) and 7 (C) along with UCSC's RepeatMasker tracks show that satellite (red) repeats overlap some but not all regions of apparent mapped sequence enrichment. This systematic bias is also apparent within genes and at transcription start sites (D). The degree of bias varies by dataset. For example, figure E, derived from Valouev et al.'s GABP ChIP-Seq data, shows very pronounced transcription start site read enrichment in the control input and chIP sample.

Figure 2

Figure 2

Impact of systematic bias on the number of false positives. The number of false positives due to systematic bias can be quite substantial. Figure 2 plots the number of false positives in the control unamplified input data from Johnson et al.'s NRSF study as a function of the number of window reads and Bonferroni corrected global Poisson p-values.

Figure 3

Figure 3

Performance of different window summary statistics. A comparison of the TPR against the FDR for four different window summary statistics associated with the low (A) and high (B) spike-in datasets. Each of the window scanning statistics (Sum: sum of the chIP reads within the window, no input; Diff: difference between the number of chIP reads and input reads; NormDiff: normalized difference, the difference divided by the square root of the sum; BinPVal: binomial p-value; BinPVal Min 10: binomial p-value using a prefiltered dataset where only windows with 10 or more total reads from the chIP and input datasets were used to score the mapped spike-in data. Multiple sets of enriched regions where generated over a range of thresholds. Each set was then intersected with the appropriate spike-in key and the TPR and FDR plotted.

Figure 4

Figure 4

Performance of two confidence estimations. A comparison of two FDR estimations against the real FDR for the two spike-in datasets with low and high number of reads. Empirical FDRs (eFDR) were calculated and plotted (A) for a variety of thresholds by dividing the number of control enriched regions (input1 vs. input2) by the number of chIP enriched regions (chIP vs. input2). The actual FDR was calculated by intersecting each enriched region set with the spike-in key. The q-value FDR estimation (B) is made by calculating binomial p-values for each window and applying the Storey q-value FDR approximation. For each of the spike-in datasets, windows were generated using two different minimum number of reads, either 1 or 10. The latter represents a filtered dataset that improves the q-value estimation at the cost of test sensitivity (see also Figure 2A).

References

    1. Collas P, Dahl JA. Chop it, ChIP it, check it: the current status of chromatin immunoprecipitation. Front Biosci. pp. 929–43. - DOI - PubMed
    1. Bhinge AA, Kim J, Euskirchen GM, Snyder M, Iyer VR. Mapping the chromosomal targets of STAT1 by Sequence Tag Analysis of Genomic Enrichment (STAGE) Genome Res. 2007;17(6):910–6. doi: 10.1101/gr.5574907. - DOI - PMC - PubMed
    1. Ng P, Wei CL, Ruan Y. Paired-end diTagging for transcriptome and genome analysis. Curr Protoc Mol Biol. 2007;Chapter 21 - PubMed
    1. Bulyk ML. DNA microarray technologies for measuring protein-DNA interactions. Curr Opin Biotechnol. 2006;17(4):422–30. doi: 10.1016/j.copbio.2006.06.015. - DOI - PMC - PubMed
    1. Johnson DS, Mortazavi A, Myers RM, Wold B. Genome-wide mapping of in vivo protein-DNA interactions. Science. pp. 1497–502. - DOI - PubMed

Publication types

MeSH terms

Substances

LinkOut - more resources