Sparse linear modeling of next-generation mRNA sequencing (RNA-Seq) data for isoform discovery and abundance estimation - PubMed (original) (raw)

Sparse linear modeling of next-generation mRNA sequencing (RNA-Seq) data for isoform discovery and abundance estimation

Jingyi Jessica Li et al. Proc Natl Acad Sci U S A. 2011.

Abstract

Since the inception of next-generation mRNA sequencing (RNA-Seq) technology, various attempts have been made to utilize RNA-Seq data in assembling full-length mRNA isoforms de novo and estimating abundance of isoforms. However, for genes with more than a few exons, the problem tends to be challenging and often involves identifiability issues in statistical modeling. We have developed a statistical method called "sparse linear modeling of RNA-Seq data for isoform discovery and abundance estimation" (SLIDE) that takes exon boundaries and RNA-Seq data as input to discern the set of mRNA isoforms that are most likely to present in an RNA-Seq sample. SLIDE is based on a linear model with a design matrix that models the sampling probability of RNA-Seq reads from different mRNA isoforms. To tackle the model unidentifiability issue, SLIDE uses a modified Lasso procedure for parameter estimation. Compared with deterministic isoform assembly algorithms (e.g., Cufflinks), SLIDE considers the stochastic aspects of RNA-Seq reads in exons from different isoforms and thus has increased power in detecting more novel isoforms. Another advantage of SLIDE is its flexibility of incorporating other transcriptomic data such as RACE, CAGE, and EST into its model to further increase isoform discovery accuracy. SLIDE can also work downstream of other RNA-Seq assembly algorithms to integrate newly discovered genes and exons. Besides isoform discovery, SLIDE sequentially uses the same linear model to estimate the abundance of discovered isoforms. Simulation and real data studies show that SLIDE performs as well as or better than major competitors in both isoform discovery and abundance estimation. The SLIDE software package is available at https://sites.google.com/site/jingyijli/SLIDE.zip.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

Fig. 1.

(A) Definition of subexons: transcribed regions between adjacent alternative splicing sites. (B) A two-exon mRNA transcript. _s_1, _e_1, _s_2, and _e_2, genomic positions associated with a paired-end read. r, the read end length; _L_1 and _L_2, the exon lengths.

Fig. 2.

Isoform discovery results. (A) Precision and recall rates of SLIDE on 50 simulated datasets, with different colors for groups of genes with n subexons (n = 3,⋯,10) and every point representing the average precision and recall rates of every group on one dataset. (B) Precision and recall rates of SLIDE (using annotated genes/exons) and Cufflinks on dataset 1. Numbers, group indices of genes (i.e., numbers of subexons); squares/stars, SLIDE/Cufflinks results. (C) Precision and recall rates of SLIDE (using Cufflinks assembled genes/exons) and Cufflinks on dataset 1.

Fig. 3.

Abundance estimation results. (A) p vs. median formula image of 798 isoforms on 50 simulated datasets. p, true isoform proportion; median , median of the 50 estimated isoform proportions. (B) SLIDE vs. SIIER estimates of the 798 isoforms on dataset 1. (C) SLIDE vs. Cufflinks estimates of the 798 isoforms on dataset 1.

Fig. 4.

Miscellaneous effects. (A) Precision and recall rates of SLIDE on 37 bp and 76 bp paired-end RNA-Seq data (datasets 2–3). (B) Precision and recall rates of SLIDE on dataset 4 with paired-end data only (squares), single-end data only (stars), and both (diamonds).

Cited by

Transcriptomic and metabolic analysis unveils the mechanism behind leaf color development in Disanthus cercidifolius var. longipes.
Tian X, Xiang G, Lv H, Zhu L, Peng J, Li G, Mou C. Tian X, et al. Front Mol Biosci. 2024 Feb 6;11:1343123. doi: 10.3389/fmolb.2024.1343123. eCollection 2024. Front Mol Biosci. 2024. PMID: 38380429 Free PMC article.
A safety framework for flow decomposition problems via integer linear programming.
Dias FHC, Cáceres M, Williams L, Mumey B, Tomescu AI. Dias FHC, et al. Bioinformatics. 2023 Nov 1;39(11):btad640. doi: 10.1093/bioinformatics/btad640. Bioinformatics. 2023. PMID: 37862229 Free PMC article.
Efficient Minimum Flow Decomposition via Integer Linear Programming.
Dias FHC, Williams L, Mumey B, Tomescu AI. Dias FHC, et al. J Comput Biol. 2022 Nov;29(11):1252-1267. doi: 10.1089/cmb.2022.0257. Epub 2022 Oct 18. J Comput Biol. 2022. PMID: 36260412 Free PMC article.
Techniques for Profiling the Cellular Immune Response and Their Implications for Interventional Oncology.
Garg T, Weiss CR, Sheth RA. Garg T, et al. Cancers (Basel). 2022 Jul 26;14(15):3628. doi: 10.3390/cancers14153628. Cancers (Basel). 2022. PMID: 35892890 Free PMC article. Review.
Modern Approaches for Transcriptome Analyses in Plants.
Riaño-Pachón DM, Espitia-Navarro HF, Riascos JJ, Margarido GRA. Riaño-Pachón DM, et al. Adv Exp Med Biol. 2021;1346:11-50. doi: 10.1007/978-3-030-80352-0_2. Adv Exp Med Biol. 2021. PMID: 35113394

References

1. Trapnell C, et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol. 2010;28:511–515. - PMC - PubMed
1. Dohm JC, Lottaz C, Borodina T, Himmelbauer H. Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucleic Acids Res. 2010;36:e105. - PMC - PubMed
1. Hansen KD, Brenner SE, Dudoit S. Biases in Illumina transcriptome sequencing caused by random hexamer priming. Nucleic Acids Res. 2010;38:e131. - PMC - PubMed
1. Li J, Jiang H, Wong WH. Modeling non-uniformity in short-read rates in RNA-Seq data. Genome Biol. 2010;11:R50. - PMC - PubMed
1. Roberts A, Trapnell C, Donaghey J, Rinn JL, Pachter L. Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biol. 2011;12:R22. - PMC - PubMed

Sparse linear modeling of next-generation mRNA sequencing (RNA-Seq) data for isoform discovery and abundance estimation - PubMed (original) (raw)