Joint modeling of ChIP-seq data via a Markov random field model (original) (raw)

Joint modelling of ChIP-seq data via a Markov random field model

2013

Chromatin ImmunoPrecipitation-sequencing (ChIP-seq) experiments have now become routine in biology for the detection of protein binding sites. In this paper, we present a Markov random field model for the joint analysis of multiple ChIP-seq experiments. The proposed model naturally accounts for spatial dependencies in the data, by assuming first order Markov properties, and for the large proportion of zero counts, by using zero-inflated mixture distributions. In contrast to all other available implementations, the model allows for the joint modelling of multiple experiments, by incorporating key aspects of the experimental design. In particular, the model uses the information about replicates and about the different antibodies used in the experiments. An extensive simulation study shows a lower false non-discovery rate for the proposed method, compared to existing methods, at the same false discovery rate. Finally, we present an analysis on real data for the detection of histone modifications of two chromatin modifiers from eight ChIP-seq experiments, including technical replicates with different IP efficiencies.

Hierarchical hidden Markov model with application to joint analysis of ChIP-chip and ChIP-seq data

Bioinformatics, 2009

Motivation: Chromatin immunoprecipitation (ChIP) experiments followed by array hybridization, or ChIP-chip, is a powerful approach for identifying transcription factor binding sites (TFBS) and has been widely used. Recently, massively parallel sequencing coupled with ChIP experiments (ChIP-seq) has been increasingly used as an alternative to ChIP-chip, offering cost-effective genomewide coverage and resolution up to a single basepair. For many well-studied transcription factors, both ChIP-seq and ChIP-chip experiments have been applied and their data are publicly available. Previous analyses have revealed substantial technology-specific binding signals despite strong correlation between the two sets of results. Therefore, it is of interest to see whether the two data sources can be combined to enhance the detection of TFBS. Results: In this work, hierarchical hidden Markov model (HHMM) is proposed for combining data from ChIP-seq and ChIP-chip. In HHMM, inference results from individual HMMs in ChIP-seq and ChIP-chip experiments are summarized by a higher level HMM. Simulation studies show the advantage of HHMM when data from both technologies co-exist. Analysis of two well-studied transcription factors, NRSF and CTCF, also suggests that HHMM yields improved TFBS identification in comparison to analyses using individual data sources or a simple merger of the two. Availability: Source code for the software ChIPmeta freely available for download at

Mixture models for analysing high throughput sequencing data

2011

The goal of my thesis is to develop methods and software for analysing highthroughput sequencing data, emphasizing sonicated ChIP-seq. For this goal, we developed a few variants of mixture models for genome-wide profiling of transcription factor binding sites and nucleosome positions. Our methods have been implemented into Bioconductor packages, which are freely available to other researchers. For profiling transcription factor binding sites, we developed a method, PICS, and implemented it into a Bioconductor package. We used a simulation study to confirm that PICS compares favourably to rival methods, such as MACS, QuEST, CisGenome, and USeq. Using published GABP and FOXA1 data from human cell lines, we then show that PICS predicted binding sites were more consistent with computationally predicted binding motifs than the alternative methods.

Genome-Wide Localization of Protein-DNA Binding and Histone Modification by a Bayesian Change-Point Method with ChIP-seq Data

PLoS Computational Biology, 2012

Next-generation sequencing (NGS) technologies have matured considerably since their introduction and a focus has been placed on developing sophisticated analytical tools to deal with the amassing volumes of data. Chromatin immunoprecipitation sequencing (ChIP-seq), a major application of NGS, is a widely adopted technique for examining protein-DNA interactions and is commonly used to investigate epigenetic signatures of diffuse histone marks. These datasets have notoriously high variance and subtle levels of enrichment across large expanses, making them exceedingly difficult to define. Windows-based, heuristic models and finite-state hidden Markov models (HMMs) have been used with some success in analyzing ChIP-seq data but with lingering limitations. To improve the ability to detect broad regions of enrichment, we developed a stochastic Bayesian Change-Point (BCP) method, which addresses some of these unresolved issues. BCP makes use of recent advances in infinite-state HMMs by obtaining explicit formulas for posterior means of read densities. These posterior means can be used to categorize the genome into enriched and unenriched segments, as is customarily done, or examined for more detailed relationships since the underlying subpeaks are preserved rather than simplified into a binary classification. BCP performs a near exhaustive search of all possible change points between different posterior means at high-resolution to minimize the subjectivity of window sizes and is computationally efficient, due to a speed-up algorithm and the explicit formulas it employs. In the absence of a well-established ''gold standard'' for diffuse histone mark enrichment, we corroborated BCP's island detection accuracy and reproducibility using various forms of empirical evidence. We show that BCP is especially suited for analysis of diffuse histone ChIP-seq data but also effective in analyzing punctate transcription factor ChIP datasets, making it widely applicable for numerous experiment types.

PICS: Probabilistic Inference for ChIP-seq

Biometrics, 2011

ChIP-seq, which combines chromatin immunoprecipitation with massively parallel short-read sequencing, can profile in vivo genome-wide transcription factor-DNA association with higher sensitivity, specificity and spatial resolution than ChIP-chip. While it presents new opportunities for research, ChIP-seq poses new challenges for statistical analysis that derive from the complexity of the biological systems characterized and the variability and biases in its digital sequence data. We propose a method called PICS (Probabilistic Inference for ChIP-seq) for extracting information from ChIP-seq aligned-read data in order to identify regions bound by transcription factors. PICS identifies enriched regions by modeling local concentrations of directional reads, and uses DNA fragment length prior information to discriminate closely adjacent binding events via a Bayesian hierarchical t-mixture model. Its per-event fragment length estimates also allow it to remove from analysis regions that have atypical lengths. PICS uses pre-calculated, whole-genome read mappability profiles and a truncated tdistribution to adjust binding event models for reads that are missing due to local genome repetitiveness. It estimates uncertainties in model parameters that can be used to define confidence regions on binding event locations and to filter estimates. Finally, PICS calculates a per-event enrichment score relative to a control sample, and can use a control sample to estimate a false discovery rate. We compared PICS to the alternative methods MACS, QuEST, and CisGenome, using published GABP and FOXA1 data sets from human cell lines, and found that PICS' predicted binding sites were more consistent with computationally predicted binding motifs.

Probabilistic partitioning methods to find significant patterns in ChIP-Seq data

Bioinformatics (Oxford, England), 2014

We have witnessed an enormous increase in ChIP-Seq data for histone modifications in the past few years. Discovering significant patterns in these data is an important problem for understanding biological mechanisms. We propose probabilistic partitioning methods to discover significant patterns in ChIP-Seq data. Our methods take into account signal magnitude, shape, strand orientation and shifts. We compare our methods with some current methods and demonstrate significant improvements, especially with sparse data. Besides pattern discovery and classification, probabilistic partitioning can serve other purposes in ChIP-Seq data analysis. Specifically, we exemplify its merits in the context of peak finding and partitioning of nucleosome positioning patterns in human promoters. The software and code are available in the supplementary material. Supplementary data are available at Bioinformatics online.

Model-Free Inference for ChIP-Seq Data

Journal of Data Mining in Genomics & Proteomics, 2014

Due to its higher resolution mapping and stronger ChIP enrichment signals, ChIP-seq tends to replace ChIP-chip technology in studying genome-wide protein-DNA interactions, while the massive digital ChIP-seq data present new challenges to statisticians. To date, most methods proposed in the literature for ChIP-seq data analysis are model based, however, finding a single model workable for all datasets is impossible, given the complexity of biological systems and variations generated in the sequencing process. In this paper, we present a model-free approach, the so-called MICS (Model-free Inference for ChIP-Seq), for ChIP-seq data analysis. MICS has a few advantages over the existing methods: Firstly, MICS avoids assumptions for the data distribution, and thus it maintains high power even when model assumptions for the data are violated. Secondly, MICS employs a simulation-based method in estimating the false discovery rate. Since the simulation-based method works independently of ChIP samples, MICS can perform robustly to variety of ChIP samples; it can produce accurate identification of peak regions, even for those where the enrichment is weak. Thirdly, MICS is very efficient in computation, which takes only a few seconds on a personal computer for a reasonably large dataset. In this paper, we also present a simple semi-empirical method for simulating ChIP-seq data, which allows a better assessment of performance of different approaches for ChIP-seq data analysis. MICS is compared with several existing methods, including MACS, CCAT, PICS, BayesPeak and QuEST, based on real and simulated datasets. The numerical results indicate that MICS can outperform others. Availability: An R package called MICS is available at http://www.stat.tamu.edu/\~mqwu.

ChIP-BIT: Bayesian inference of target genes using a novel joint probabilistic model of ChIP-seq profiles

Nucleic acids research, 2015

Chromatin immunoprecipitation with massively parallel DNA sequencing (ChIP-seq) has greatly improved the reliability with which transcription factor binding sites (TFBSs) can be identified from genome-wide profiling studies. Many computational tools are developed to detect binding events or peaks, however the robust detection of weak binding events remains a challenge for current peak calling tools. We have developed a novel Bayesian approach (ChIP-BIT) to reliably detect TFBSs and their target genes by jointly modeling binding signal intensities and binding locations of TFBSs. Specifically, a Gaussian mixture model is used to capture both binding and background signals in sample data. As a unique feature of ChIP-BIT, background signals are modeled by a local Gaussian distribution that is accurately estimated from the input data. Extensive simulation studies showed a significantly improved performance of ChIP-BIT in target gene prediction, particularly for detecting weak binding sig...

Zero-Inflated Models to Identify Transcription Factor Binding Sites in ChIP-seq Experiments

2015

ZERO-INFLATED MODELS TO IDENTIFY TRANSCRIPTION FACTOR BINDING SITES IN CHIP-SEQ EXPERIMENTS Sameera Dhananjaya Viswakula Old Dominion University, 2015 Director: Dr. Norou Diawara It is essential to determine the protein-DNA binding sites to understand many bi­ ological processes. A transcription factor is a particular type of protein that binds to DNA and controls gene regulation in living organisms. Chromatin immunoprecipitation followed by highthroughput sequencing (ChlP-seq) is considered the gold stan­ dard in locating these binding sites and programs use to identify DNA-transcription factor binding sites are known as peak-callers. ChlP-seq data are known to exhibit considerable background noise and other biases. In this study, we propose a nega­ tive binomial model (NB), a zero-inflated Poisson model (ZIP) and a zero-inflated negative binomial model (ZINB) for peak-calling. Using real ChlP-seq datasets, we show that ZINB model is the best model for ChlP-seq data. Then we incorp...

ChIP-GMM: A Gaussian Mixture Model for Inferring Binding Regions in ChIP-seq Profiles

2017

Chromatin immunoprecipitation (ChIP), followed by high-throughput DNA sequencing (ChIP-seq), enables genome-wide mapping of transcription-factor binding sites (TFBS). Several transcription factors (TFs) have been known to be able to differentiate tumor sub-types in diseases like cancer. For instance, the Luminal A and Luminal B sub-types of breast cancer tumors are high in estrogen receptor (ER) while human epidermal growth factor receptor 2 (HER2) tumors are high in HER2 protein. The accurate mapping of the DNAprotein loci is important in determining the causality of epigenetic regulation of gene expression under both normal and disease conditions in order to promote the development of targeted drug therapy. In this paper, we leverage the popular variational Bayes framework for Gaussian mixture models to demonstrate its effectiveness in identifying transcription-factor binding sites (TFBS) and common regions co-regulated by multiple TFs. We show that our method performs favorably w...