Association testing of bisulfite-sequencing methylation data via a Laplace approximation - PubMed (original) (raw)
Association testing of bisulfite-sequencing methylation data via a Laplace approximation
Omer Weissbrod et al. Bioinformatics. 2017.
Abstract
Motivation: Epigenome-wide association studies can provide novel insights into the regulation of genes involved in traits and diseases. The rapid emergence of bisulfite-sequencing technologies enables performing such genome-wide studies at the resolution of single nucleotides. However, analysis of data produced by bisulfite-sequencing poses statistical challenges owing to low and uneven sequencing depth, as well as the presence of confounding factors. The recently introduced Mixed model Association for Count data via data AUgmentation (MACAU) can address these challenges via a generalized linear mixed model when confounding can be encoded via a single variance component. However, MACAU cannot be used in the presence of multiple variance components. Additionally, MACAU uses a computationally expensive Markov Chain Monte Carlo (MCMC) procedure, which cannot directly approximate the model likelihood.
Results: We present a new method, Mixed model Association via a Laplace ApproXimation (MALAX), that is more computationally efficient than MACAU and allows to model multiple variance components. MALAX uses a Laplace approximation rather than MCMC based approximations, which enables to directly approximate the model likelihood. Through an extensive analysis of simulated and real data, we demonstrate that MALAX successfully addresses statistical challenges introduced by bisulfite-sequencing while controlling for complex sources of confounding, and can be over 50% faster than the state of the art.
Availability and implementation: The full source code of MALAX is available at https://github.com/omerwe/MALAX .
Contact: omerw@cs.technion.ac.il or ehalperin@cs.ucla.edu.
Supplementary information: Supplementary data are available at Bioinformatics online.
© The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com
Figures
Fig. 1
QQ plots of the evaluated methods, computed using only sites not directly associated with the phenotype, under simulated datasets with population structure and with various proportions of DM sites. Each figure aggregates the results of 10 simulated datasets. The 95% CI of the expected null distribution is shaded in gray. The mean and SD of the genomic control inflation factor of each method is shown next to its name. all methods suffer from some degree of inflation in the presence of severe confounding, but MALAX-2 always controls for type I error as well as or better than the alternative methods. The three methods that do not control for confounding due to methylation similarity become increasingly less calibrated as the percentage of DM sites increases
Fig. 2
The detection power of the evaluated methods under simulated datasets with various proportions of DM sites. All results are averaged over 10 simulated datasets. The three methods that control only for genetic confounding are more powerful than the other ones in the absence of DM sites, but MALAX-2 and MALAX-1m become increasingly more powerful as the percentage of DM sites increases
Fig. 3
Box plots describing the running times of the evaluated methods in the presence of simulated datasets with varying proportions of DM sites and sample sizes (n), and with 10 000 sites. The flat boxes at the bottom represent the BB method
Fig. 4
The correlation between the _P_-values computed by MALAX-1g and by MACAU across simulated datasets. The sites are sorted according to the _P_-values computed by MALAX-1g. The shown values ρ are the Pearson correlation between the _P_-values (in log scale)
Fig. 5
A QQ plot of the _P_-values obtained by the evaluated methods in the analysis of the baboons data
Fig. 6
A Manhattan plot of the _P_-values obtained by the evaluated methods in the analysis of the baboons data. The axis labels for several chromosomes are omitted to improve clarity
References
- Balding D.J., Nichols R.A. (1995) A method for quantifying differentiation between populations at multi-allelic loci and its implications for investigating identity and paternity. Genetica, 96, 3–12. -PubMed
- Behnel S. et al. (2011) Cython: the best of both worlds. Comput. Sci. Eng., 13, 31–39.
- Bird A. (2007) Perceptions of epigenetics. Nature, 447, 396–398. -PubMed
- Byrd R.H. et al. (1995) A limited memory algorithm for bound constrained optimization. SIAM J. Sci. Comput, 16, 1190–1208.
MeSH terms
Substances
LinkOut - more resources
Full Text Sources
Other Literature Sources