AREM: aligning short reads from ChIP-sequencing by expectation maximization - PubMed (original) (raw)
AREM: aligning short reads from ChIP-sequencing by expectation maximization
Daniel Newkirk et al. J Comput Biol. 2011 Nov.
Abstract
High-throughput sequencing coupled to chromatin immunoprecipitation (ChIP-Seq) is widely used in characterizing genome-wide binding patterns of transcription factors, cofactors, chromatin modifiers, and other DNA binding proteins. A key step in ChIP-Seq data analysis is to map short reads from high-throughput sequencing to a reference genome and identify peak regions enriched with short reads. Although several methods have been proposed for ChIP-Seq analysis, most existing methods only consider reads that can be uniquely placed in the reference genome, and therefore have low power for detecting peaks located within repeat sequences. Here, we introduce a probabilistic approach for ChIP-Seq data analysis that utilizes all reads, providing a truly genome-wide view of binding patterns. Reads are modeled using a mixture model corresponding to K enriched regions and a null genomic background. We use maximum likelihood to estimate the locations of the enriched regions, and implement an expectation-maximization (E-M) algorithm, called AREM (aligning reads by expectation maximization), to update the alignment probabilities of each read to different genomic locations. We apply the algorithm to identify genome-wide binding events of two proteins: Rad21, a component of cohesin and a key factor involved in chromatid cohesion, and Srebp-1, a transcription factor important for lipid/cholesterol homeostasis. Using AREM, we were able to identify 19,935 Rad21 peaks and 1,748 Srebp-1 peaks in the mouse genome with high confidence, including 1,517 (7.6%) Rad21 peaks and 227 (13%) Srebp-1 peaks that were missed using only uniquely mapped reads. The open source implementation of our algorithm is available at http://sourceforge.net/projects/arem.
Figures
FIG. 1.
(A) AREM workflow diagram. (B–E) DE novo discovery of motifs. From top to bottom: (B) CTCF in MACS peaks from uniquely mapping reads, (C) CTCF in AREM's peaks with multireads, (D) Srebp-1 in MACS peaks from uniquely mapping reads and (E) Srebp-1 in AREM peaks with multireads.
FIG. 2.
Graphs displaying varying parameters and number of possible alignments per read. (A) Total number of peaks discovered. (B) Percentage of peaks with repetitive sequences. (C) False discovery rate. (D) Percentage of peaks with motif.
Similar articles
- Discovering transcription factor binding sites in highly repetitive regions of genomes with multi-read analysis of ChIP-Seq data.
Chung D, Kuan PF, Li B, Sanalkumar R, Liang K, Bresnick EH, Dewey C, Keleş S. Chung D, et al. PLoS Comput Biol. 2011 Jul;7(7):e1002111. doi: 10.1371/journal.pcbi.1002111. Epub 2011 Jul 14. PLoS Comput Biol. 2011. PMID: 21779159 Free PMC article. - CNV-guided multi-read allocation for ChIP-seq.
Zhang Q, Keleş S. Zhang Q, et al. Bioinformatics. 2014 Oct 15;30(20):2860-7. doi: 10.1093/bioinformatics/btu402. Epub 2014 Jun 24. Bioinformatics. 2014. PMID: 24966364 Free PMC article. - PICS: probabilistic inference for ChIP-seq.
Zhang X, Robertson G, Krzywinski M, Ning K, Droit A, Jones S, Gottardo R. Zhang X, et al. Biometrics. 2011 Mar;67(1):151-63. doi: 10.1111/j.1541-0420.2010.01441.x. Biometrics. 2011. PMID: 20528864 - Computation for ChIP-seq and RNA-seq studies.
Pepke S, Wold B, Mortazavi A. Pepke S, et al. Nat Methods. 2009 Nov;6(11 Suppl):S22-32. doi: 10.1038/nmeth.1371. Nat Methods. 2009. PMID: 19844228 Free PMC article. Review. - Handling multi-mapped reads in RNA-seq.
Deschamps-Francoeur G, Simoneau J, Scott MS. Deschamps-Francoeur G, et al. Comput Struct Biotechnol J. 2020 Jun 12;18:1569-1576. doi: 10.1016/j.csbj.2020.06.014. eCollection 2020. Comput Struct Biotechnol J. 2020. PMID: 32637053 Free PMC article. Review.
Cited by
- Disregarding multimappers leads to biases in the functional assessment of NGS data.
Almeida da Paz M, Warger S, Taher L. Almeida da Paz M, et al. BMC Genomics. 2024 May 8;25(1):455. doi: 10.1186/s12864-024-10344-9. BMC Genomics. 2024. PMID: 38720252 Free PMC article. - Bioinformatics Approaches for Determining the Functional Impact of Repetitive Elements on Non-coding RNAs.
Zeng C, Takeda A, Sekine K, Osato N, Fukunaga T, Hamada M. Zeng C, et al. Methods Mol Biol. 2022;2509:315-340. doi: 10.1007/978-1-0716-2380-0_19. Methods Mol Biol. 2022. PMID: 35796972 - Sequence deeper without sequencing more: Bayesian resolution of ambiguously mapped reads.
Shah RN, Ruthenburg AJ. Shah RN, et al. PLoS Comput Biol. 2021 Apr 19;17(4):e1008926. doi: 10.1371/journal.pcbi.1008926. eCollection 2021 Apr. PLoS Comput Biol. 2021. PMID: 33872311 Free PMC article. - CSA: a web service for the complete process of ChIP-Seq analysis.
Li M, Tang L, Wu FX, Pan Y, Wang J. Li M, et al. BMC Bioinformatics. 2019 Dec 24;20(Suppl 15):515. doi: 10.1186/s12859-019-3090-0. BMC Bioinformatics. 2019. PMID: 31874601 Free PMC article. - Fast and efficient short read mapping based on a succinct hash index.
Zhang H, Chan Y, Fan K, Schmidt B, Liu W. Zhang H, et al. BMC Bioinformatics. 2018 Mar 9;19(1):92. doi: 10.1186/s12859-018-2094-5. BMC Bioinformatics. 2018. PMID: 29523083 Free PMC article.
References
- Bailey T.L. Elkan C. The value of prior knowledge in discovering motifs with MEME. Proc. Int. Conf. Intell. Syst. Mol. Biol. 1995;3:21–29. - PubMed
- Chuzhanova N. Abeysinghe S.S. Krawczak M., et al. Translocation and gross deletion breakpoints in human inherited disease and cancer. II: Potential involvement of repetitive sequence elements in secondary structure formation between DNA ends. Hum. Mutat. 2003;22:245–251. - PubMed
Publication types
MeSH terms
Substances
LinkOut - more resources
Full Text Sources