Characterizing the genetic basis of transcriptome diversity through RNA-sequencing of 922 individuals - PubMed (original) (raw)

doi: 10.1101/gr.155192.113. Epub 2013 Oct 3.

Sara Mostafavi, Xiaowei Zhu, James B Potash, Myrna M Weissman, Courtney McCormick, Christian D Haudenschild, Kenneth B Beckman, Jianxin Shi, Rui Mei, Alexander E Urban, Stephen B Montgomery, Douglas F Levinson, Daphne Koller

Affiliations

Characterizing the genetic basis of transcriptome diversity through RNA-sequencing of 922 individuals

Alexis Battle et al. Genome Res. 2014 Jan.

Abstract

Understanding the consequences of regulatory variation in the human genome remains a major challenge, with important implications for understanding gene regulation and interpreting the many disease-risk variants that fall outside of protein-coding regions. Here, we provide a direct window into the regulatory consequences of genetic variation by sequencing RNA from 922 genotyped individuals. We present a comprehensive description of the distribution of regulatory variation--by the specific expression phenotypes altered, the properties of affected genes, and the genomic characteristics of regulatory variants. We detect variants influencing expression of over ten thousand genes, and through the enhanced resolution offered by RNA-sequencing, for the first time we identify thousands of variants associated with specific phenotypes including splicing and allelic expression. Evaluating the effects of both long-range intra-chromosomal and trans (cross-chromosomal) regulation, we observe modularity in the regulatory network, with three-dimensional chromosomal configuration playing a particular role in regulatory modules within each chromosome. We also observe a significant depletion of regulatory variants affecting central and critical genes, along with a trend of reduced effect sizes as variant frequency increases, providing evidence that purifying selection and buffering have limited the deleterious impact of regulatory variation on the cell. Further, generalizing beyond observed variants, we have analyzed the genomic properties of variants associated with expression and splicing and developed a Bayesian model to predict regulatory consequences of genetic variants, applicable to the interpretation of individual genomes and disease studies. Together, these results represent a critical step toward characterizing the complete landscape of human regulatory variation.

PubMed Disclaimer

Figures

Figure 1.

Figure 1.

_Cis_-regulatory variation and allelic effects. (A) Schematic illustration of aseQTL. Heterozygosity at a regulatory locus is linked to allelic imbalance detected from RNA-seq reads over a separate heterozygous coding SNP (a second, separate locus) in the corresponding gene. Conversely, individuals who are homozygous at the regulatory SNP will show balanced allelic expression at the coding SNP (still estimated among individuals who are heterozygous at the coding SNP). (B) Example of aseQTL, the most significant association in this analysis. Rs4950928, a known asthma risk variant SNP in the 5′ UTR of CHI3L1, is associated with allelic imbalance in the coding region of CHI3L1, with heterozygous individuals showing significantly increased allelic imbalance compared to individuals homozygous for either the reference or nonreference allele (P < 10−71) (Methods). (C) Distribution of significant ASE by individual. In each individual, we evaluate the fraction of testable heterozygous loci (requiring sufficient read depth and other filters) (see Supplemental Material) with significant ASE (binomial P ≤ 10−3). To evaluate the distribution of ASE not explained by heterozygosity for a common regulatory variant, we then evaluate the same set of testable loci, but only counting ASE when the individual is not heterozygous for a corresponding eQTL SNP. In this case, we consider SNPs that are significant at P ≤ 10−3 for the corresponding eQTL gene.

Figure 2.

Figure 2.

Distant and modular intra-chromosomal regulation. (A) Q-Q plot of aseQTL _P_-values for intra-chromosomal eQTLs of varying distances. For eQTLs implicating SNPs beyond each distance threshold from the corresponding TSS (0 kb, 20 kb, 100 kb, 300 Kb, 500 kb), we computed aseQTL association tests between the eQTL SNP and allelic ratios at all exonic loci available for the corresponding gene, taking the best association identified from these. The expected _P_-value distribution and 95% CIs were computed empirically from repeated random draws of SNPs similarly tested against exonic loci within each eQTL gene. We observe that distant eQTLs show more ASE than expected by chance, although the enrichment declines with distance. (B) Schematic of a genomic region on chromosome 16 containing coregulated genes, along with nearby genes and SNPs having an impact on each gene. Rs11644386 affects a discontinuous group of genes, with the farthest association (CYLD) being >400 kb away, and does not have significant associations with two intermediate genes SNX20 and NOD2. Another SNP, rs8047222, is associated with expression of NKD1 and a nearby gene NOD2, but has no influence on the more distant genes BRD7 and ADCY7.

Figure 3.

Figure 3.

_Trans_-regulatory variation and mediation through proximal genes. (A) Example subnetwork of significant associations centered on expression of IKZF1. The SNP rs10251980 is associated with expression of the nearby gene IKZF1 (P < 10−8), along with eight distant genes (P < 10−12). IZKF1 is also coexpressed with six of the eight genes (P < 0.05). (B) Prevalence of candidate regulatory network structures for _trans_-eQTLs including the SNP, the corresponding distant genes, and any genes proximal to the SNP that are also associated with its genotype. For each _trans_-eQTL gene, we analyzed its relationship with the most strongly associated SNP, along with all genes within 1 Mb of that SNP. Network structures best fitting each set were identified using likelihood ratio tests (Methods; Supplemental Material). (C) Association between rs2759386 and isoform ratio of FYB, potentially mediated through expression of splicing factor QKI, which is proximal to the SNP. Rs2759386 is associated with total expression levels of QKI, and both this SNP and QKI are associated with isoform ratio of FYB (P < 10−14 and P < 10−16, respectively).

Figure 4.

Figure 4.

Distribution of _cis_-regulatory variation and selective pressure. (A) Effect size of _cis_-eQTLs compared to minor allele frequency of the most significant SNP per eQTL gene (computed using subsampling) (Methods). We find a strong inverse relationship (Spearman's r = −0.13, P < 10−7). If we normalize by the observed variance of each gene, the observed relationship becomes stronger (P < 10−39). (B) A depletion of _cis_-eQTLs is evident (P < 0.05) among genes with many protein–protein interactions (PPI); additionally, a strong negative correlation exists between the number of interactions and eQTL effect size (P < 10−35). Protein coding genes were put into quantile buckets according to the number of known PPI relationships (Methods). The fraction of genes in each bucket having a significant _cis_-eQTL was computed along with the average effect size for the observed eQTLs. Fewer eQTLs are observed among genes with the most interactions (hub genes). Genes in the bottom 20% may be moderately depleted as well, although confidence intervals (95%) are overlapping with the intermediate deciles. (C) The fraction of genes with a significant _cis_-eQTL and average eQTL effect size are shown according to an estimate of the genes' regulatory impact. Known regulatory genes were put into quantiles according to the strength of correlation observed between their expression profile and the expression of all nonregulatory genes. Nonregulatory genes are shown in the leftmost bar for comparison. Strong regulatory genes show significant depletion of eQTLs (P < 10−2) compared to nonregulators and weak regulatory genes and, similarly, reduced eQTL effect sizes (P < 10−100).

Figure 5.

Figure 5.

Genomic properties of regulatory variation and prediction of eQTLs. (A) Enrichment of proximal eQTLs and sQTLs is shown as a function of distance to the TSS. Enrichment is computed here as the log odds multiplier on likelihood of association (Methods). In the zoomed, intrageneic view, enrichment (log odds multiplier) of proximal eQTLs and sQTLs is shown within gene boundaries for UTR, intronic, and exonic loci. We aggregate SNPs within all exons except the first (the closest to TSS) together, and likewise for introns. (B) Enrichment of _cis_-eQTLs and sQTLs for functional and genomic annotations, controlling for distance. In each case, (log) odds multipliers were computed for each category after conditioning on SNP location (Methods) shown in A and B. ChIP-seq and DNase I annotation enrichments are shown here for SNPs falling within 20 kb upstream of TSS; for full enrichment statistics see Supplemental Data S2. (C) Enrichment of _cis_-eQTLs stratified by LRVM score (restricting to genes and SNPs excluded from training LRVM). Each SNP-gene pair was scored by LRVM for likelihood of association, and twenty quantiles were computed for the resulting scores. Finally, enrichment was computed for each quantile, using log odds estimation after correcting for position (Methods). (D) Predicted regulatory impact of trait-associated (GWAS) SNPs according to LRVM for 263 unique disease variants not available during LRVM training. We compute the score of each SNP for each of its proximal genes. Known trait-associated SNPs score more highly that expected at random (P < 10−9), indicating enrichment for properties that match those of observed regulatory variants. (E) LRVM scores are predictive of allelic effects, indicative of _cis_-regulatory impact. We correlate allelic imbalance (Methods) observed among heterozygous individuals for each SNP with the score assigned by each predictive model to the corresponding SNP. Significance is estimated using the Wilcoxon rank sum test. Again, analysis is restricted to SNPs not used to train LRVM.

References

    1. Aranda B, Achuthan P, Alam-Faruque Y, Armean I, Bridge A, Derow C, Feuermann M, Ghanbarian AT, Kerrien S, Khadake J, et al. 2010. The IntAct molecular interaction database in 2010. Nucleic Acids Res 38: D525–D531 - PMC - PubMed
    1. Bodmer W, Bonilla C 2008. Common and rare variants in multifactorial susceptibility to common diseases. Nat Genet 40: 695–701 - PMC - PubMed
    1. Boyle AP, Hong EL, Hariharan M, Cheng Y, Schaub MA, Kasowski M, Karczewski KJ, Park J, Hitz BC, Weng S, et al. 2012. Annotation of functional variation in personal genomes using RegulomeDB. Genome Res 22: 1790–1797 - PMC - PubMed
    1. Breitkreutz BJ, Stark C, Tyers M 2003. The GRID: The General Repository for Interaction Datasets. Genome Biol 4: R23. - PMC - PubMed
    1. Chatr-aryamontri A, Ceol A, Palazzi LM, Nardelli G, Schneider MV, Castagnoli L, Cesareni G 2007. MINT: The Molecular INTeraction database. Nucleic Acids Res 35: D572–D574 - PMC - PubMed

Publication types

MeSH terms

Substances

LinkOut - more resources