Design and analysis of ChIP-seq experiments for DNA-binding proteins - PubMed (original) (raw)

. 2008 Dec;26(12):1351-9.

doi: 10.1038/nbt.1508. Epub 2008 Nov 16.

Affiliations

Design and analysis of ChIP-seq experiments for DNA-binding proteins

Peter V Kharchenko et al. Nat Biotechnol. 2008 Dec.

Abstract

Recent progress in massively parallel sequencing platforms has enabled genome-wide characterization of DNA-associated proteins using the combination of chromatin immunoprecipitation and sequencing (ChIP-seq). Although a variety of methods exist for analysis of the established alternative ChIP microarray (ChIP-chip), few approaches have been described for processing ChIP-seq data. To fill this gap, we propose an analysis pipeline specifically designed to detect protein-binding positions with high accuracy. Using previously reported data sets for three transcription factors, we illustrate methods for improving tag alignment and correcting for background signals. We compare the sensitivity and spatial precision of three peak detection algorithms with published methods, demonstrating gains in spatial precision when an asymmetric distribution of tags on positive and negative strands is considered. We also analyze the relationship between the depth of sequencing and characteristics of the detected binding positions, and provide a method for estimating the sequencing depth necessary for a desired coverage of protein binding sites.

PubMed Disclaimer

Figures

Figure 1

Figure 1

a. Main steps of the proposed ChIP-seq processing pipeline. b. A schematic illustration of ChIP-seq measurements. DNA is fragmented or digested, and fragments cross-linked to the protein of interest are selected with IP. The 5’ ends (squares) of the selected fragments are sequenced, typically forming groups of positive and negative strand tags on the two sides of the protected region. The dashed red line illustrates a fragment generated from a long cross-link that may account for the tag patterns observed in CTCF and STAT1 datasets. c. Tag distribution around a stable NRSF binding position. Vertical lines show the number of tags (right axis) whose 5’ position maps to a given location on positive (red) or negative (blue) strands. Positive and negative values on the y-axis are used to illustrate tags mapping to positive and negative strands respectively. The solid curves show tag density for each strand (left axis, based on Gaussian kernel with σ =15bp). d. Strand cross-correlation for the NRSF data. The y-axis shows Pearson linear correlation coefficient between genome-wide profiles of tag density of positive and negative strands, shifted relative to each-other by a distance specified on the x-axis. The peak position (red vertical line) indicates a typical distance separating positive- and negative-strand peaks associated with the stable binding positions.

Figure 2

Figure 2. Selecting informative tag classes based on the change in strand cross-correlation magnitude

For each class of tag alignment quality listed in Table 1, the plots show the change in strand mean cross-correlation profile when this class of tags is considered together with the base class of perfectly aligned tags (25bp, no mismatches). Three plots correspond to tag classes (a) without mismatches, (b) with a single mismatch, and (c) with two mismatches. Informative tag classes improve cross-correlation (marked by *), and are incorporated into the final tag set. The y-axis gives the mean change in cross-correlation profile within 40bp around the cross-correlation peak (Figure 1d).

Figure 3

Figure 3. Examples of anomalies in background tag distributions

a. Singular positions with extremely high tag count. b. Larger, non-uniform regions of increased background tag density. c. Background tag density patterns resembling true protein binding positions. Each plot shows density of tags from ChIP and input samples. The tag histograms give combined tag counts.

Figure 4

Figure 4. Binding position detection methods and their relative sensitivity

a. Schematic illustration of the Window Tag Density (WTD) method. To identify positions with a tag pattern expected from a strong binding, the method calculates the difference between geometric average of the tag counts within the regions marked by orange color (p1 and n2), and the average tag count within the regions marked by green color (n1 and p2). b. The Matching Strand Peaks (MSP) method first identifies local maxima on positive and negative strands (open circles) and then determines positions where such two peaks are present in the right order, with the expected separation and comparable magnitude. c. The Mirror Tag Correlation (MTC) method is based on the mirror correlation of positive and negative-strand tag densities. The mirror image of negative-strand tag density is shown by dashed blue line. The tags within 15bp of the center position are omitted. d. Coverage of high-confidence NRSF motif matches by top peaks. The plot shows the fraction of motif instances that coincide (with 50bp) with identified binding positions, as a function of increasing number of top binding positions identified by different methods. Most methods, except for MSP and CSP are able to achieve similarly high coverage.

Figure 5

Figure 5. Accuracy of determined binding positions

a. Distribution of distances between high-confidence NRSF motif instances and locations of binding positions identified by different methods. The standard deviation of the resulting distribution (σ) is shown for each method. Only motifs containing a binding position within 100bp were considered. b–d. The fraction of the identified binding positions within the 10bp of the motif position is shown for an increasing numbers of top binding positions identified by different methods. Only binding positions occurring within 300bp of a sequence motif instance are included in the analysis. Median distance to motif center was subtracted for each method to account for non-central position of sequence motif relative to the center of the protected binding region (see Methods). The MTC method achieves highest accuracy for CTCF and STAT1; however, WTD gives more accurate positions for the NRSF binding.

Figure 6

Figure 6. Analysis of sequencing depth

a. Given the NRSF binding positions determined using complete dataset (y-axis), the black curve shows the fraction of positions that can be predicted (within 50bp) using smaller portions of the tag data (x-axis). All of the binding predictions are generated using FDR of 0.01 using the WTD method. The curve does not reach a horizontal asymptote, indicating that the set of detected NRSF binding sites has not stabilized at the current sequencing depth. The additional curves limit the analysis to binding positions whose fold enrichment ratio over the background is significantly (P<0.05) higher than 7.5 (MSER: Minimal Saturated Enrichment Ratio, dashed line) and 30 (dotted line). The observed enrichment ratios are evaluated independently for each tag subsample (x-axis). b. Distribution of tag counts around high-confidence NRSF motif positions. Positions with zero tags were not included. c. The relationship between MSER of the detected binding positions and sequencing depth (expressed as a fraction of the complete dataset). The dashed gray line shows a log-log model that can be used to estimate the sequencing depth required to saturate detection of binding positions with lower fold-enrichment ratio. By that estimate, 1.2×106 more sequence tags would be necessary to saturate detection of binding positions that are two-fold enriched over background (MSER=2 corresponds to _y_=0, at which the red line crosses x-axis: _x_=2.8×106).

Similar articles

Cited by

References

    1. Kim TH, Ren B. Genome-wide analysis of protein-DNA interactions. Annual review of genomics and human genetics. 2006;7:81–102. - PubMed
    1. Johnson DS, Mortazavi A, Myers RM, Wold B. Genome-wide mapping of in vivo protein-DNA interactions. Science. 2007;316:1497–1502. - PubMed
    1. Impey S, et al. Defining the CREB regulon: a genome-wide analysis of transcription factor regulatory regions. Cell. 2004;119:1041–1054. - PubMed
    1. Roh TY, Cuddapah S, Zhao K. Active chromatin domains are defined by acetylation islands revealed by genome-wide mapping. Genes Dev. 2005;19:542–552. - PMC - PubMed
    1. Bhinge AA, Kim J, Euskirchen GM, Snyder M, Iyer VR. Mapping the chromosomal targets of STAT1 by Sequence Tag Analysis of Genomic Enrichment (STAGE) Genome Res. 2007;17:910–916. - PMC - PubMed

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources