Design and analysis of ChIP-seq experiments for DNA-binding proteins - PubMed (original) (raw)
. 2008 Dec;26(12):1351-9.
doi: 10.1038/nbt.1508. Epub 2008 Nov 16.
Affiliations
- PMID: 19029915
- PMCID: PMC2597701
- DOI: 10.1038/nbt.1508
Design and analysis of ChIP-seq experiments for DNA-binding proteins
Peter V Kharchenko et al. Nat Biotechnol. 2008 Dec.
Abstract
Recent progress in massively parallel sequencing platforms has enabled genome-wide characterization of DNA-associated proteins using the combination of chromatin immunoprecipitation and sequencing (ChIP-seq). Although a variety of methods exist for analysis of the established alternative ChIP microarray (ChIP-chip), few approaches have been described for processing ChIP-seq data. To fill this gap, we propose an analysis pipeline specifically designed to detect protein-binding positions with high accuracy. Using previously reported data sets for three transcription factors, we illustrate methods for improving tag alignment and correcting for background signals. We compare the sensitivity and spatial precision of three peak detection algorithms with published methods, demonstrating gains in spatial precision when an asymmetric distribution of tags on positive and negative strands is considered. We also analyze the relationship between the depth of sequencing and characteristics of the detected binding positions, and provide a method for estimating the sequencing depth necessary for a desired coverage of protein binding sites.
Figures
Figure 1
a. Main steps of the proposed ChIP-seq processing pipeline. b. A schematic illustration of ChIP-seq measurements. DNA is fragmented or digested, and fragments cross-linked to the protein of interest are selected with IP. The 5’ ends (squares) of the selected fragments are sequenced, typically forming groups of positive and negative strand tags on the two sides of the protected region. The dashed red line illustrates a fragment generated from a long cross-link that may account for the tag patterns observed in CTCF and STAT1 datasets. c. Tag distribution around a stable NRSF binding position. Vertical lines show the number of tags (right axis) whose 5’ position maps to a given location on positive (red) or negative (blue) strands. Positive and negative values on the y-axis are used to illustrate tags mapping to positive and negative strands respectively. The solid curves show tag density for each strand (left axis, based on Gaussian kernel with σ =15bp). d. Strand cross-correlation for the NRSF data. The y-axis shows Pearson linear correlation coefficient between genome-wide profiles of tag density of positive and negative strands, shifted relative to each-other by a distance specified on the x-axis. The peak position (red vertical line) indicates a typical distance separating positive- and negative-strand peaks associated with the stable binding positions.
Figure 2. Selecting informative tag classes based on the change in strand cross-correlation magnitude
For each class of tag alignment quality listed in Table 1, the plots show the change in strand mean cross-correlation profile when this class of tags is considered together with the base class of perfectly aligned tags (25bp, no mismatches). Three plots correspond to tag classes (a) without mismatches, (b) with a single mismatch, and (c) with two mismatches. Informative tag classes improve cross-correlation (marked by *), and are incorporated into the final tag set. The y-axis gives the mean change in cross-correlation profile within 40bp around the cross-correlation peak (Figure 1d).
Figure 3. Examples of anomalies in background tag distributions
a. Singular positions with extremely high tag count. b. Larger, non-uniform regions of increased background tag density. c. Background tag density patterns resembling true protein binding positions. Each plot shows density of tags from ChIP and input samples. The tag histograms give combined tag counts.
Figure 4. Binding position detection methods and their relative sensitivity
a. Schematic illustration of the Window Tag Density (WTD) method. To identify positions with a tag pattern expected from a strong binding, the method calculates the difference between geometric average of the tag counts within the regions marked by orange color (p1 and n2), and the average tag count within the regions marked by green color (n1 and p2). b. The Matching Strand Peaks (MSP) method first identifies local maxima on positive and negative strands (open circles) and then determines positions where such two peaks are present in the right order, with the expected separation and comparable magnitude. c. The Mirror Tag Correlation (MTC) method is based on the mirror correlation of positive and negative-strand tag densities. The mirror image of negative-strand tag density is shown by dashed blue line. The tags within 15bp of the center position are omitted. d. Coverage of high-confidence NRSF motif matches by top peaks. The plot shows the fraction of motif instances that coincide (with 50bp) with identified binding positions, as a function of increasing number of top binding positions identified by different methods. Most methods, except for MSP and CSP are able to achieve similarly high coverage.
Figure 5. Accuracy of determined binding positions
a. Distribution of distances between high-confidence NRSF motif instances and locations of binding positions identified by different methods. The standard deviation of the resulting distribution (σ) is shown for each method. Only motifs containing a binding position within 100bp were considered. b–d. The fraction of the identified binding positions within the 10bp of the motif position is shown for an increasing numbers of top binding positions identified by different methods. Only binding positions occurring within 300bp of a sequence motif instance are included in the analysis. Median distance to motif center was subtracted for each method to account for non-central position of sequence motif relative to the center of the protected binding region (see Methods). The MTC method achieves highest accuracy for CTCF and STAT1; however, WTD gives more accurate positions for the NRSF binding.
Figure 6. Analysis of sequencing depth
a. Given the NRSF binding positions determined using complete dataset (y-axis), the black curve shows the fraction of positions that can be predicted (within 50bp) using smaller portions of the tag data (x-axis). All of the binding predictions are generated using FDR of 0.01 using the WTD method. The curve does not reach a horizontal asymptote, indicating that the set of detected NRSF binding sites has not stabilized at the current sequencing depth. The additional curves limit the analysis to binding positions whose fold enrichment ratio over the background is significantly (P<0.05) higher than 7.5 (MSER: Minimal Saturated Enrichment Ratio, dashed line) and 30 (dotted line). The observed enrichment ratios are evaluated independently for each tag subsample (x-axis). b. Distribution of tag counts around high-confidence NRSF motif positions. Positions with zero tags were not included. c. The relationship between MSER of the detected binding positions and sequencing depth (expressed as a fraction of the complete dataset). The dashed gray line shows a log-log model that can be used to estimate the sequencing depth required to saturate detection of binding positions with lower fold-enrichment ratio. By that estimate, 1.2×106 more sequence tags would be necessary to saturate detection of binding positions that are two-fold enriched over background (MSER=2 corresponds to _y_=0, at which the red line crosses x-axis: _x_=2.8×106).
Similar articles
- ChIPulate: A comprehensive ChIP-seq simulation pipeline.
Datta V, Hannenhalli S, Siddharthan R. Datta V, et al. PLoS Comput Biol. 2019 Mar 21;15(3):e1006921. doi: 10.1371/journal.pcbi.1006921. eCollection 2019 Mar. PLoS Comput Biol. 2019. PMID: 30897079 Free PMC article. - A practical comparison of methods for detecting transcription factor binding sites in ChIP-seq experiments.
Laajala TD, Raghav S, Tuomela S, Lahesmaa R, Aittokallio T, Elo LL. Laajala TD, et al. BMC Genomics. 2009 Dec 18;10:618. doi: 10.1186/1471-2164-10-618. BMC Genomics. 2009. PMID: 20017957 Free PMC article. - Software for rapid time dependent ChIP-sequencing analysis (TDCA).
Myschyshyn M, Farren-Dai M, Chuang TJ, Vocadlo D. Myschyshyn M, et al. BMC Bioinformatics. 2017 Nov 25;18(1):521. doi: 10.1186/s12859-017-1936-x. BMC Bioinformatics. 2017. PMID: 29178831 Free PMC article. - A short survey of computational analysis methods in analysing ChIP-seq data.
Kim H, Kim J, Selby H, Gao D, Tong T, Phang TL, Tan AC. Kim H, et al. Hum Genomics. 2011 Jan;5(2):117-23. doi: 10.1186/1479-7364-5-2-117. Hum Genomics. 2011. PMID: 21296745 Free PMC article. Review. - Transcription Factor Binding Site Mapping Using ChIP-Seq.
Jaini S, Lyubetskaya A, Gomes A, Peterson M, Tae Park S, Raman S, Schoolnik G, Galagan J. Jaini S, et al. Microbiol Spectr. 2014 Apr;2(2). doi: 10.1128/microbiolspec.MGM2-0035-2013. Microbiol Spectr. 2014. PMID: 26105820 Review.
Cited by
- Enhancing stability of recombinant CHO cells by CRISPR/Cas9-mediated site-specific integration into regions with distinct histone modifications.
Hertel O, Neuss A, Busche T, Brandt D, Kalinowski J, Bahnemann J, Noll T. Hertel O, et al. Front Bioeng Biotechnol. 2022 Oct 13;10:1010719. doi: 10.3389/fbioe.2022.1010719. eCollection 2022. Front Bioeng Biotechnol. 2022. PMID: 36312557 Free PMC article. - Trimethylation of Lys36 on H3 restricts gene expression change during aging and impacts life span.
Pu M, Ni Z, Wang M, Wang X, Wood JG, Helfand SL, Yu H, Lee SS. Pu M, et al. Genes Dev. 2015 Apr 1;29(7):718-31. doi: 10.1101/gad.254144.114. Genes Dev. 2015. PMID: 25838541 Free PMC article. - Computational methodology for ChIP-seq analysis.
Shin H, Liu T, Duan X, Zhang Y, Liu XS. Shin H, et al. Quant Biol. 2013 Mar 1;1(1):54-70. doi: 10.1007/s40484-013-0006-2. Quant Biol. 2013. PMID: 25741452 Free PMC article. - Integrative network analysis reveals USP7 haploinsufficiency inhibits E-protein activity in pediatric T-lineage acute lymphoblastic leukemia (T-ALL).
Shaw TI, Dong L, Tian L, Qian C, Liu Y, Ju B, High A, Kavdia K, Pagala VR, Shaner B, Pei D, Easton J, Janke LJ, Porter SN, Ma X, Cheng C, Pruett-Miller SM, Choi J, Yu J, Peng J, Gu W, Look AT, Downing JR, Zhang J. Shaw TI, et al. Sci Rep. 2021 Mar 4;11(1):5154. doi: 10.1038/s41598-021-84647-2. Sci Rep. 2021. PMID: 33664368 Free PMC article. - Systematic evaluation of factors influencing ChIP-seq fidelity.
Chen Y, Negre N, Li Q, Mieczkowska JO, Slattery M, Liu T, Zhang Y, Kim TK, He HH, Zieba J, Ruan Y, Bickel PJ, Myers RM, Wold BJ, White KP, Lieb JD, Liu XS. Chen Y, et al. Nat Methods. 2012 Jun;9(6):609-14. doi: 10.1038/nmeth.1985. Epub 2012 Apr 22. Nat Methods. 2012. PMID: 22522655 Free PMC article.
References
- Kim TH, Ren B. Genome-wide analysis of protein-DNA interactions. Annual review of genomics and human genetics. 2006;7:81–102. - PubMed
- Johnson DS, Mortazavi A, Myers RM, Wold B. Genome-wide mapping of in vivo protein-DNA interactions. Science. 2007;316:1497–1502. - PubMed
- Impey S, et al. Defining the CREB regulon: a genome-wide analysis of transcription factor regulatory regions. Cell. 2004;119:1041–1054. - PubMed
Publication types
MeSH terms
Substances
Grants and funding
- UL1 RR024920/RR/NCRR NIH HHS/United States
- R01GM082798/GM/NIGMS NIH HHS/United States
- R01 GM082798/GM/NIGMS NIH HHS/United States
- U01 HG004258-01/HG/NHGRI NIH HHS/United States
- R01 GM082798-03/GM/NIGMS NIH HHS/United States
- UL1RR024920/RR/NCRR NIH HHS/United States
- UL1 RR024920-01/RR/NCRR NIH HHS/United States
- U01HG004258/HG/NHGRI NIH HHS/United States
- U01 HG004258/HG/NHGRI NIH HHS/United States
LinkOut - more resources
Full Text Sources
Other Literature Sources
Miscellaneous