Inference of isoforms from short sequence reads - PubMed (original) (raw)

Inference of isoforms from short sequence reads

Jianxing Feng et al. J Comput Biol. 2011 Mar.

Abstract

Due to alternative splicing events in eukaryotic species, the identification of mRNA isoforms (or splicing variants) is a difficult problem. Traditional experimental methods for this purpose are time consuming and cost ineffective. The emerging RNA-Seq technology provides a possible effective method to address this problem. Although the advantages of RNA-Seq over traditional methods in transcriptome analysis have been confirmed by many studies, the inference of isoforms from millions of short sequence reads (e.g., Illumina/Solexa reads) has remained computationally challenging. In this work, we propose a method to calculate the expression levels of isoforms and infer isoforms from short RNA-Seq reads using exon-intron boundary, transcription start site (TSS) and poly-A site (PAS) information. We first formulate the relationship among exons, isoforms, and single-end reads as a convex quadratic program, and then use an efficient algorithm (called IsoInfer) to search for isoforms. IsoInfer can calculate the expression levels of isoforms accurately if all the isoforms are known and infer novel isoforms from scratch. Our experimental tests on known mouse isoforms with both simulated expression levels and reads demonstrate that IsoInfer is able to calculate the expression levels of isoforms with an accuracy comparable to the state-of-the-art statistical method and a 60 times faster speed. Moreover, our tests on both simulated and real reads show that it achieves a good precision and sensitivity in inferring isoforms when given accurate exon-intron boundary, TSS, and PAS information, especially for isoforms whose expression levels are significantly high. The software is publicly available for free at http://www.cs.ucr.edu/∼jianxing/IsoInfer.html.

PubMed Disclaimer

Figures

FIG. 1.

FIG. 1.

Expressed segments. Every exon-intron boundary introduces a boundary of some segment. Every expressed segment is a part of an exon.

FIG. 2.

FIG. 2.

(Left) A paired-end read consisting of two short reads of length _L_2 that are separated by a gap. (Right) Three consecutive intervals on an isoform.

FIG. 3.

FIG. 3.

The flow of data processing in algorithm IsoInfer.

FIG. 4.

FIG. 4.

Comparison of the accuracies of different methods in estimating isoform expression levels. The _y_-axis shows the percentage of isoforms whose estimated/calculated expression levels are within a certain relative difference range from the truth. 10 million reads (left) and 80 million reads (right) are sampled in each of the figures.

FIG. 5.

FIG. 5.

The sensitivity (top left), effective sensitivity (top right) and precision (bottom left) of IsoInfer on genes with a certain number of isoforms when different distributions of expression levels are generated. (Bottom right) Sensitivity of IsoInfer on different expression levels when different distributions of expression level are applied. In the graph, the expression levels are log_2 transformed. Expression level x corresponds to 25 · 2_x RPKM. The vertical line corresponds to expression level 1/8 = 3.125 RPKM.

FIG. 6.

FIG. 6.

The sensitivity (top left), effective sensitivity (top right) and precision (bottom left) of IsoInfer on genes with a certain number of isoforms when different combinations of type I, II, and III data are provided. (Bottom right) Sensitivity of IsoInfer on different expression levels when different combinations of type I, II, and III data are used. Again, the expression levels are log_2 transformed. Expression level x corresponds to 25 · 2_x RPKM. The vertical line corresponds to expression level 1/8 = 3.125 RPKM.

FIG. 7.

FIG. 7.

The sensitivity and precision of IsoInfer when α is set to different values.

Similar articles

Cited by

References

    1. Alkan C. Kidd J.M. Marques-Bonet T., et al. Personalized copy number and segmental duplication maps using next-generation sequencing. Nat. Genet. 2009;41:1061–1067. - PMC - PubMed
    1. Alter M.D. Rubin D.B. Ramsey K., et al. Variation in the large-scale organization of gene expression levels in the hippocampus relates to stable epigenetic variability in behavior. PLoS ONE. 2008;3:e3344. - PMC - PubMed
    1. Bertone P. Stolc V. Royce T.E., et al. Global identification of human transcribed sequences with genome tiling arrays. Science. 2004;306:2242–2246. - PubMed
    1. Bishop C.M. Pattern Recognition and Machine Learning. Springer; New York: 2007.
    1. Boguski M.S. The turning point in genome research. Trends Biochem. Sci. 1995;20:295–296. - PubMed

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources