A statistical method for the detection of alternative splicing using RNA-seq - PubMed (original) (raw)

A statistical method for the detection of alternative splicing using RNA-seq

Liguo Wang et al. PLoS One. 2010.

Abstract

Deep sequencing of transcriptome (RNA-seq) provides unprecedented opportunity to interrogate plausible mRNA splicing patterns by mapping RNA-seq reads to exon junctions (thereafter junction reads). In most previous studies, exon junctions were detected by using the quantitative information of junction reads. The quantitative criterion (e.g. minimum of two junction reads), although is straightforward and widely used, usually results in high false positive and false negative rates, owning to the complexity of transcriptome. Here, we introduced a new metric, namely Minimal Match on Either Side of exon junction (MMES), to measure the quality of each junction read, and subsequently implemented an empirical statistical model to detect exon junctions. When applied to a large dataset (>200M reads) consisting of mouse brain, liver and muscle mRNA sequences, and using independent transcripts databases as positive control, our method was proved to be considerably more accurate than previous ones, especially for detecting junctions originated from low-abundance transcripts. Our results were also confirmed by real time RT-PCR assay. The MMES metric can be used either in this empirical statistical model or in other more sophisticated classifiers, such as logistic regression.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1

Figure 1. MMES metric and its skewed distribution over ESJ and ERJ.

(A) Calculation of Minimal Match on Either Side (MMES) of junction. Each square represents a nucleotide. The figure shows that eight 25-mer reads are aligned to a 42-mer exon junction (the bottom). The exon junction is composed of two equal parts: the left part (filled in red) is from the last 21bp of the upstream exon, and the right part (filled in blue) is from the first 21 bp of the downstream exon. Within aligned reads, the matched nucleotides are colored in either red (left) or blue (right), and the mismatch nucleotides are blank. MMES score is placed on the right side of each read. (B) MMES score (see main text) distribution on Exon Splicing Junction (ESJ, red lines) and Exon Random Junction (ERJ, blue lines). For both ESJ and ERJ, mapped reads are divided into 3 categories: 0 mismatch (circle), 1 mismatch (triangle) and 2 mismatches (cross).

Figure 2

Figure 2. Performance of MMES based empirical method.

(A) Relationship between “percent of splicing junctions detected” and p-value cut-off threshold. Splicing junctions are grouped by the number of covering reads R, the pink line indicates the incurred FDR when the corresponding cutting-off p-value is selected, and the vertical dashed line indicates the p-value = 0.01 cutoff (with incurred FDR = 4.8%). (B) In case of p-value threshold = 0.01, all junctions are divided into two classes: those junctions with p-value≤0.01 are predicted to be real, while those junctions with p-value>0.01 are predicted to be false. Each class is further divided into 5 sub-classes according to number of covering reads. For each sub-class, percent of junctions verified (PPV) is calculated by cross validating predicted junctions with combined alternative splicing database.

Figure 3

Figure 3. Comparison of MMES based empirical approach with read-counting method and logistic regression model.

(A) All splicing junctions predicted by either method are divided into 3 non-overlapping categories: “P0.01_uniq” refers to those junctions with only 1 covering read but with p-value≤0.01 (green); “R2_uniq” refers to junctions with at least 2 covering reads but with p-value>0.01 (red). “Common” refers to those junctions with at least 2 covering reads and with p-value≤0.01 (blue). (B) Validation rate (PPV) for “P0.01_uniq”, “R2_uniq”, and “Common”, respectively. (C) “P0.01_uniq” refers to those junctions detected by MMES based empirical method only (green), “LR_uniq” refers to those junctions identified by logistic regression only (red), “BothSig” refers to junctions identified by both models (blue) and “BothUnsig” refers to junctions rejected by both method. (D) Validation rate (PPV) for “P0.01_uniq”, “LR_uniq”, and “Common”, respectively.

Similar articles

Cited by

References

    1. Johnson JM, Castle J, Garrett-Engele P, Kan Z, Loerch PM, et al. Genome-wide survey of human alternative pre-mRNA splicing with exon junction microarrays. Science. 2003;302:2141–2144. - PubMed
    1. Wang ET, Sandberg R, Luo S, Khrebtukova I, Zhang L, et al. Alternative isoform regulation in human tissue transcriptomes. Nature 2008 - PMC - PubMed
    1. Pan Q, Shai O, Lee LJ, Frey BJ, Blencowe BJ. Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing. Nat Genet. 2008;40:1413–1415. - PubMed
    1. Venables JP. Aberrant and alternative splicing in cancer. Cancer Res. 2004;64:7647–7654. - PubMed
    1. Grasso C, Modrek B, Xing Y, Lee C. Genome-wide detection of alternative splicing in expressed sequences using partial order multiple sequence alignment graphs. Pac Symp Biocomput. 2004:29–41. - PubMed

Publication types

MeSH terms

Substances

LinkOut - more resources