MutPred Splice: machine learning-based prediction of exonic variants that disrupt splicing - PubMed (original) (raw)

MutPred Splice: machine learning-based prediction of exonic variants that disrupt splicing

Matthew Mort et al. Genome Biol. 2014.

Abstract

We have developed a novel machine-learning approach, MutPred Splice, for the identification of coding region substitutions that disrupt pre-mRNA splicing. Applying MutPred Splice to human disease-causing exonic mutations suggests that 16% of mutations causing inherited disease and 10 to 14% of somatic mutations in cancer may disrupt pre-mRNA splicing. For inherited disease, the main mechanism responsible for the splicing defect is splice site loss, whereas for cancer the predominant mechanism of splicing disruption is predicted to be exon skipping via loss of exonic splicing enhancers or gain of exonic splicing silencer elements. MutPred Splice is available at http://mutdb.org/mutpredsplice.

PubMed Disclaimer

Figures

Figure 1

Figure 1

Feature ranking for Disease negative set versus SNP negative set (Iter. 1), shown by means of the average AUC using 10-fold cross-validation. The linear support vector machine (SVM) classifier was trained with only the specific feature (or feature subset) that was being tested. As a control, each training example had a randomly generated numerical value computed. AUC values for all features were then compared with the AUC produced by a classifier trained with only the randomly generated attribute by means of a Bonferroni corrected _t_-test (P < 0.05). Significantly different AUC values compared to the random attribute are indicated by asterisks in parentheses for the respective data sets (significant Disease negative set feature, significant SNP negative set feature). Features are ranked by reference to the Disease negative set.

Figure 2

Figure 2

Model performance evaluation using ROC curves when applied to the same unseen test of 352 variants (238 positive and 114 negative). For each of the four training sets (Table 2), three different RF classification models were built (Iter. 1, Iter. 2 and Iter. 3). The percentage AUC for each training set and specific iteration are shown in parentheses.

Figure 3

Figure 3

Case study illustrating the semi-supervised approach employed in this study. The disease-causing (DM) missense mutation CM080465 in the OPA1 gene (NM_015560.2: c.1199C > T; NP_056375.2: p.P400L) was not originally reported to disrupt splicing but was later shown in vitro to disrupt pre-mRNA splicing [25]. CM080465 was included in the negative set in the first iteration (Iter. 1). The Iter. 1 model, however, predicted CM080465 to disrupt pre-mRNA splicing (SAV). In the next iteration (Iter. 2), CM080465 was excluded from the negative set. The Iter. 2 model still predicted CM080465 to be a SAV and so, in the final iteration (Iter. 3), this variant was included in the positive set. This demonstrated that a semi-supervised approach can, at least in some instances, correctly re-label an incorrectly labeled training example. SAV, splice-altering variant; SNV, splice neutral variant.

Figure 4

Figure 4

Role of exonic variants in aberrant mRNA processing for Inherited disease and Cancer data sets. The somatic Cancer variants were derived from COSMIC and include both driver and passenger mutations. For all mutation types and the combined total, the proportions of predicted SAVs in both Inherited disease and Cancer were significantly enriched (Fisher’s exact test with Bonferroni correction applied; P < 0.05) when compared to exonic variants identified in the 1000 Genomes Project (unlike the SNP negative training set, in this instance no MAF filter was applied, that is, all rare and common variants were included).

Figure 5

Figure 5

Confident hypotheses of the underlying splicing mechanism disrupted for predicted exonic SAVs in Inherited disease and somatic variants in Cancer. Significant enrichment (+) or depletion (-) for a specific hypothesis is shown for the Cancer versus Inherited disease datasets (Fisher’s exact test with a Bonferroni-corrected threshold of P < 0.05).

Figure 6

Figure 6

Proportion of exonic variants involved in aberrant mRNA processing for a set of tumor suppressor genes (71 genes) and a set of oncogenes (54 genes), from three different data sets (Inherited disease, somatic mutations in Cancer, and variants identified in the 1000 Genomes Project with no MAF filter applied, that is, all rare and common variants included). Disease-causing substitutions in tumor suppressor (TS) genes tend to be recessive loss-of-function mutations, in contrast to disease-causing substitutions in oncogenes, which are usually dominant gain-of-function mutations. Inherited disease and Cancer are significantly enriched in the TS gene set (denoted by an asterisk), when compared with the equivalent set of oncogenes, for mutations that are predicted to result in aberrant mRNA processing (SAVs). _P_-values were calculated using a Fisher’s exact test with a Bonferroni-corrected threshold of P < 0.05.

Similar articles

Cited by

References

    1. Stenson PD, Mort M, Ball EV, Shaw K, Phillips AD, Cooper DN. The human gene mutation database: building a comprehensive mutation repository for clinical and molecular genetics, diagnostic testing and personalized genomic medicine. Hum Genet. 2014;133:1–9. doi: 10.1007/s00439-013-1358-4. - DOI - PMC - PubMed
    1. Ng PC, Henikoff S. SIFT: Predicting amino acid changes that affect protein function. Nucleic Acids Res. 2003;31:3812–3814. doi: 10.1093/nar/gkg509. - DOI - PMC - PubMed
    1. Adzhubei IA, Schmidt S, Peshkin L, Ramensky VE, Gerasimova A, Bork P, Kondrashov AS, Sunyaev SR. A method and server for predicting damaging missense mutations. Nat Methods. 2010;7:248–249. doi: 10.1038/nmeth0410-248. - DOI - PMC - PubMed
    1. Ferrer-Costa C, Gelpi JL, Zamakola L, Parraga I, de la Cruz X, Orozco M. PMUT: a web-based tool for the annotation of pathological mutations on proteins. Bioinformatics. 2005;21:3176–3178. doi: 10.1093/bioinformatics/bti486. - DOI - PubMed
    1. Ryan M, Diekhans M, Lien S, Liu Y, Karchin R. LS-SNP/PDB: annotated non-synonymous SNPs mapped to Protein Data Bank structures. Bioinformatics. 2009;25:1431–1432. doi: 10.1093/bioinformatics/btp242. - DOI - PMC - PubMed

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources