Flexible statistical methods for estimating and testing effects in genomic studies with multiple conditions - PubMed (original) (raw)

Flexible statistical methods for estimating and testing effects in genomic studies with multiple conditions

Sarah M Urbut et al. Nat Genet. 2019 Jan.

Abstract

We introduce new statistical methods for analyzing genomic data sets that measure many effects in many conditions (for example, gene expression changes under many treatments). These new methods improve on existing methods by allowing for arbitrary correlations in effect sizes among conditions. This flexible approach increases power, improves effect estimates and allows for more quantitative assessments of effect-size heterogeneity compared to simple shared or condition-specific assessments. We illustrate these features through an analysis of locally acting variants associated with gene expression (cis expression quantitative trait loci (eQTLs)) in 44 human tissues. Our analysis identifies more eQTLs than existing approaches, consistent with improved power. We show that although genetic effects on expression are extensively shared among tissues, effect sizes can still vary greatly among tissues. Some shared eQTLs show stronger effects in subsets of biologically related tissues (for example, brain-related tissues), or in only one tissue (for example, testis). Our methods are widely applicable, computationally tractable for many conditions and available online.

PubMed Disclaimer

Conflict of interest statement

Competing interests The authors declare no competing financial interests.

Figures

Figure 1

Figure 1. Overview of fitting procedure in mash, which estimates the multivariate distribution of effects present in the data.

The data consist of a matrix of summary data (e.g., Z scores) for a large number of units (e.g., gene-SNP pairs) in multiple conditions (e.g., tissues), and, optionally, their standard errors (not shown). Color indicates the sign (positive, negative) of an effect (blue, yellow) or covariance (blue, red), with shading intensity indicating size. After selecting rows containing the strongest signals (1)—in this example, the top 6 rows—we apply covariance estimation techniques to estimate candidate “data-driven” covariance matrices U k (2). To these, we add “canonical” covariance matrices U k , including the identity matrix, and matrices representing condition-specific effects. Each covariance matrix represents a pattern of effects that may occur in the data. We scale each covariance matrix by a grid of scaling factors, ω l, varying from “very small” to “very large”, which allows for a priori effect sizes to range from very small to very large. Using the entire data set, we compute maximum-likelihood estimates of the weights (relative frequencies) π_k,l_ for each (U k,ω l) combination (3), thereby learning how commonly each pattern-effect size combination occurs in the data. Finally, we compute posterior statistics using the fitted model (4); the posterior mean estimates shown in the bottom-right illustrate that effect estimates are “shrunk” adaptively using the fitted mash model.

Figure 2

Figure 2. Comparison of methods on simulated data.

Results are shown for two simulation scenarios: “shared, structured effects”, in which the non-zero effects are shared among conditions in complex, structured ways similar to patterns of eQTL sharing in the GTEx data; and “shared, unstructured effects”, in which the non-zero effects are shared among conditions but independent. Each simulation result involves n = 20,000 independent units observed at R = 44 conditions, with 400 non-null units. Panels ab show ROC curves for detecting significant units (n = 20,000 discoveries), based on unit-specific measures of significance (as in traditional meta-analyses). Panels cd show ROC curves for detecting significant effects (n × R = 44 × 20,000 = 880,000 discoveries), which requires effect-specific measures of significance. In cd, we also require the estimated sign (+/–) of each significant effect to be correct to be considered a “true positive”. Panels e and f summarize the error in the estimated effects relative to the error from a simple condition-by-condition analysis (Relative Root Mean Squared Error, or RRMSE for short). Our new method (mash) outperformed other methods, particularly in the “shared, structured effects” scenario.

Figure 3

Figure 3. Summary of primary patterns identified by mash in GTEx data.

Shown are the heatmap of the correlation matrix (a) and bar plots of the first three eigenvectors (b, c, d) of the covariance matrix U k corresponding to the dominant mixture component identified by mash (n = 16,069 independent gene-SNP pairs). This component accounts for 34% of all weight in the GTEx data. Tissues are color-coded as indicated by the tissue labels in the heatmap. The first eigenvector (b) reflects broad sharing among all tissues, with all effects in the same direction; the second eigenvector (c) captures differences between brain (and, to a lesser extent, testis and pituitary) and other tissues; the third eigenvector (d) primarily captures effects that are stronger in whole blood.

Figure 4

Figure 4. Examples illustrating that mash uses learned patterns of sharing to inform effect estimates in the GTEx data.

In panel a, each colored dot shows the original (“raw”) effect estimate for a single tissue (color-coded as in Fig. 3), with grey bars indicating ±2 standard errors. These are the data provided to mash. Panel b shows the corresponding mash estimates. In each case, mash combines information across all tissues, using the background information (patterns of sharing) learned from data on all eQTLs to produce more precise estimates. Panel c shows, for contrast, the corresponding estimates from mash-bmalite, which, due to its more restricted model, fails to capture features clearly apparent in the original data, such as strong brain effects in MCPH1. In b and c, colored dots are posterior means, and error bars depict ±2 posterior standard deviations. For all estimates, n = 83–430 individuals, depending on the tissue (Supplementary Table 3).

Figure 5

Figure 5. Number of tissues shared by sign and magnitude.

Histograms show estimated number of tissues in which top eQTLs are “shared,” considering all tissues (n = 12,171 gene-SNP pairs with a significant eQTL in at least one tissue), non-brain tissues (n = 12,117), and brain tissues only (n = 8,474), and using two different sharing definitions, by sign (a) and by magnitude (b). Sharing by sign means that the eQTLs have the same sign in the estimated effect; sharing by magnitude means that they also have similar effect sizes (within a factor of 2).

Figure 6

Figure 6. Pairwise sharing by magnitude of eQTLs among tissues.

For each pair of tissues, we considered the top eQTLs that were significant (lfsr < 0.05) in at least one of the two tissues, and plotted the proportion of these that are “shared in magnitude”—that is, have effect estimates that are the same sign and within a factor of 2 in size of one another (n = 5,605–9,811 gene-SNP pairs, depending on pair of tissues compared). Brackets surrounding tissue labels highlight groups of biologically related tissues mentioned in the text as showing particularly high levels of sharing.

Similar articles

Cited by

References

    1. Blischak JD, Tailleux L, Mitrano A, Barreiro LB & Gilad Y Mycobacterial infection induces a specific human innate immune response. Scientific Reports 5, 16882 (2015). - PMC - PubMed
    1. Ferguson JP, Cho JH & Zhao H A new approach for the joint analysis of multiple ChIP-Seq libraries with application to histone modification. Statistical Applications in Genetics and Molecular Biology 11 (2012). - PMC - PubMed
    1. Pickrell J, Berisa T, Ségurel L, Tung JY & Hinds D Detection and interpretation of shared genetic influences on 40 human traits. Nature Genetics 48, 709–717. - PMC - PubMed
    1. Dimas AS et al. Common regulatory variation impacts gene expression in a cell type-dependent manner. Science 325, 1246–1250 (2009). - PMC - PubMed
    1. Flutre T, Wen X, Pritchard J & Stephens M A statistical framework for joint eQTL analysis in multiple tissues. PLoS Genetics 9, e1003486 (2013). - PMC - PubMed

Methods-only references

    1. Bovy J, Hogg DW & Roweis ST Extreme Deconvolution: inferring complete distribution functions from noisy, heterogeneous and incomplete observations. Annals of Applied Statistics 5, 1657–1677 (2011).
    1. Larribe F & Fearnhead P Composite likelihood methods in statistical genetics. Statistica Sinica 21, 43–69 (2011).
    1. Dempster AP, Laird NM & Rubin DB Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B 39, 1–38 (1977).
    1. Varadhan R & Roland C Simple and globally convergent methods for accelerating the convergence of any EM algorithm. Scandinavian Journal of Statistics 35, 335–353 (2008).
    1. Efron B Microarrays, empirical Bayes and the two-groups model. Statistical Science 23, 1–22 (2008).

Publication types

MeSH terms

Grants and funding

LinkOut - more resources