RNA splicing. The human splicing code reveals new insights into the genetic determinants of disease - PubMed (original) (raw)

. 2015 Jan 9;347(6218):1254806.

doi: 10.1126/science.1254806. Epub 2014 Dec 18.

Babak Alipanahi 1, Leo J Lee 1, Hannes Bretschneider 2, Daniele Merico 3, Ryan K C Yuen 3, Yimin Hua 4, Serge Gueroussov 5, Hamed S Najafabadi 1, Timothy R Hughes 6, Quaid Morris 7, Yoseph Barash 8, Adrian R Krainer 4, Nebojsa Jojic 9, Stephen W Scherer 10, Benjamin J Blencowe 11, Brendan J Frey 12

Affiliations

RNA splicing. The human splicing code reveals new insights into the genetic determinants of disease

Hui Y Xiong et al. Science. 2015.

Abstract

To facilitate precision medicine and whole-genome annotation, we developed a machine-learning technique that scores how strongly genetic variants affect RNA splicing, whose alteration contributes to many diseases. Analysis of more than 650,000 intronic and exonic variants revealed widespread patterns of mutation-driven aberrant splicing. Intronic disease mutations that are more than 30 nucleotides from any splice site alter splicing nine times as often as common variants, and missense exonic disease mutations that have the least impact on protein function are five times as likely as others to alter splicing. We detected tens of thousands of disease-causing mutations, including those involved in cancers and spinal muscular atrophy. Examination of intronic and exonic variants found using whole-genome sequencing of individuals with autism revealed misspliced genes with neurodevelopmental phenotypes. Our approach provides evidence for causal variants and should enable new discoveries in precision medicine.

Copyright © 2015, American Association for the Advancement of Science.

PubMed Disclaimer

Figures

Figure 1

Figure 1. The human splicing code

(a) For a given cell type, the computational model extracts the regulatory code from a test DNA sequence and predicts the percent of transcripts with the exon spliced in, Ψ. (b) Predictions were made for 10,689 test exons profiled in 16 tissues, exons and tissues were binned according to their RNA-seq assessed values of Ψ, and for each bin (column) the distribution of code-predicted Ψ is plotted (_n_=56,104).

Figure 2

Figure 2. Accounting for RNA-binding proteins

(a) The splicing code accounts for the affinities of RNA-binding proteins assayed in 98 in vitro experiments (13). (b) When code-predicted Ψ values are subtracted from RNA-seq assessed values of Ψ, their correlations with the binding affinities mostly vanish.

Figure 3

Figure 3. Genome-wide analysis of genetic variations

(a) To assess the effect of a single nucleotide variation (SNV), the computational model is applied to the reference sequence and the variant. Then, the maximum difference ΔΨ across tissues is computed, along with a ‘regulatory score’ that also accounts for prediction confidence (Sec. S7). (b) The effect on Ψ of 658,420 intronic and exonic SNVs. (c) Locations and predicted ΔΨ of 81,608 disease annotated intronic SNVs and synonymous or missense exonic SNVs. In different sequence regions, the scores of disease SNVs tend to be larger than those of SNPs (Ansari-Bradley tests for equal dispersion, n includes both types).

Figure 4

Figure 4. Regulatory scores of GWAS SNPs

(a) Distributions of regulatory scores for GWAS-implicated SNPs (_n_=457), non-GWAS-implicated SNPs (_n_=262,347) and disease SNVs (_n_=18,291) in introns. (b) Regulatory scores of disease annotated intronic SNVs that are causal (_n_=17,631), supported by in vitro/vivo data (_n_=224), only associated (_n_=324), or associated but have additional functional evidence (_n_=112). t-test _P_-values.

Figure 5

Figure 5. The mutational landscape of spinal muscular atrophy

(a) Spinal muscular atrophy arises when there is homozygous loss of SMN1 function, but functional protein can be produced by modifying the regulation of SMN2, which differs from SMN1 in four nucleotides (red lightning bolts) and exhibits decreased inclusion of exon 7. (b) Three mutations that the splicing code predicts will increase exon 7 inclusion in SMN2 (green lighting bolts) were selected from predictions for all possible single-nucleotide substitutions 150nt into the intron. These were validated using RT-PCR (c), along with the predicted differences in SMN1 and SMN2 regulation due to three individual substitutions and all four substitutions. Predictions and RT-PCR data have a Spearman correlation of 0.82 (_P_=0.017, one-sided permutation test). (d) Predicted ΔΨ for 85 individual mutations located in four regions are plotted against RT-PCR-assessed values; the Spearman correlation is 0.74 (_P_=5.7e-16, one-sided permutation test).

Figure 6

Figure 6. The mutational landscape of nonpolyposis colorectal cancer

(a) Predicted ΔΨ for mutations in MLH1 and MSH2 arising in patients with nonpolyposis colorectal cancer, or Lynch syndrome. Coding sequence (CDS) numbering is based on GenBank NM_000249.3 and NM_000251.2 and starts at A of the ATG translation initiation codon. (b) Validation using 134 MLH1 variations tested by RT-PCR (AUC=92.4%, _P_=2.8e-24, one-sided permutation test) and 73 MSH2 variations (AUC=93.8%, _P_=8.7e-15, one-sided permutation test).

Figure 7

Figure 7. Splicing misregulation in individuals with autism

(a) Genes containing at least one SNV that the computational model predicts will cause decreased exon inclusion were identified in five autism spectrum disorder (ASD) cases and twelve controls, by thresholding ΔΨ using either the 2nd or 3rd percentile of ΔΨ for SNPs. (b) Genes that our method predicts are misregulated in ASD cases more frequently have high expression in brain tissues than in control cases. (c) The effect of varying the threshold on ΔΨ, and thus the number of case and control genes, on the odds ratio for the enrichment of central nervous system development genes (GO:0007417); in all cases, P<0.05.

Figure 8

Figure 8. Misregulated genes and functional categories enriched in individuals with autism

Gene Ontology and pathway categories that are enriched (_P_≤0.01, Fisher's exact test) in misregulated genes from ASD cases compared to controls were identified (_n_=18), along with the corresponding set of genes from ASD cases. Each gene set is shown as a red or pink dot, depending on whether the 2nd or 3rd percentile threshold was used for detection (Fig. 7a), and size is proportional to the number of genes in the set. Edge thickness indicates the fraction of genes shared between two sets. Groups of functionally related gene sets are highlighted by blond discs. The names of novel genes that are not already implicated in ASD and have neural-related phenotypes are printed in black, the names of genes already implicated in ASD are printed in red, and otherwise gene names are printed in pale blue. If a gene is in multiple categories, the number of categories is written in superscript and genes in which a stop codon is introduced by the SNV are labeled ‘s’.

Comment in

Similar articles

Cited by

References

    1. Lindblad-Toh K, et al. Nature. 2011;478:476–82. - PMC - PubMed
    1. Bernstein BE, et al. Nature. 2012;489:57–74. - PubMed
    1. Barash Y, et al. Nature. 2010;465:53–9. - PubMed
    1. Zhang C, et al. Science. 2010;329:439–43. - PMC - PubMed
    1. Barbosa-Morais NL, et al. Science. 2012;338:1587–1593. - PubMed

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources