Sparse linear modeling of next-generation mRNA sequencing (RNA-Seq) data for isoform discovery and abundance estimation - PubMed (original) (raw)
Sparse linear modeling of next-generation mRNA sequencing (RNA-Seq) data for isoform discovery and abundance estimation
Jingyi Jessica Li et al. Proc Natl Acad Sci U S A. 2011.
Abstract
Since the inception of next-generation mRNA sequencing (RNA-Seq) technology, various attempts have been made to utilize RNA-Seq data in assembling full-length mRNA isoforms de novo and estimating abundance of isoforms. However, for genes with more than a few exons, the problem tends to be challenging and often involves identifiability issues in statistical modeling. We have developed a statistical method called "sparse linear modeling of RNA-Seq data for isoform discovery and abundance estimation" (SLIDE) that takes exon boundaries and RNA-Seq data as input to discern the set of mRNA isoforms that are most likely to present in an RNA-Seq sample. SLIDE is based on a linear model with a design matrix that models the sampling probability of RNA-Seq reads from different mRNA isoforms. To tackle the model unidentifiability issue, SLIDE uses a modified Lasso procedure for parameter estimation. Compared with deterministic isoform assembly algorithms (e.g., Cufflinks), SLIDE considers the stochastic aspects of RNA-Seq reads in exons from different isoforms and thus has increased power in detecting more novel isoforms. Another advantage of SLIDE is its flexibility of incorporating other transcriptomic data such as RACE, CAGE, and EST into its model to further increase isoform discovery accuracy. SLIDE can also work downstream of other RNA-Seq assembly algorithms to integrate newly discovered genes and exons. Besides isoform discovery, SLIDE sequentially uses the same linear model to estimate the abundance of discovered isoforms. Simulation and real data studies show that SLIDE performs as well as or better than major competitors in both isoform discovery and abundance estimation. The SLIDE software package is available at https://sites.google.com/site/jingyijli/SLIDE.zip.
Conflict of interest statement
The authors declare no conflict of interest.
Figures
Fig. 1.
(A) Definition of subexons: transcribed regions between adjacent alternative splicing sites. (B) A two-exon mRNA transcript. _s_1, _e_1, _s_2, and _e_2, genomic positions associated with a paired-end read. r, the read end length; _L_1 and _L_2, the exon lengths.
Fig. 2.
Isoform discovery results. (A) Precision and recall rates of SLIDE on 50 simulated datasets, with different colors for groups of genes with n subexons (n = 3,⋯,10) and every point representing the average precision and recall rates of every group on one dataset. (B) Precision and recall rates of SLIDE (using annotated genes/exons) and Cufflinks on dataset 1. Numbers, group indices of genes (i.e., numbers of subexons); squares/stars, SLIDE/Cufflinks results. (C) Precision and recall rates of SLIDE (using Cufflinks assembled genes/exons) and Cufflinks on dataset 1.
Fig. 3.
Abundance estimation results. (A) p vs. median of 798 isoforms on 50 simulated datasets. p, true isoform proportion; median , median of the 50 estimated isoform proportions. (B) SLIDE vs. SIIER estimates of the 798 isoforms on dataset 1. (C) SLIDE vs. Cufflinks estimates of the 798 isoforms on dataset 1.
Fig. 4.
Miscellaneous effects. (A) Precision and recall rates of SLIDE on 37 bp and 76 bp paired-end RNA-Seq data (datasets 2–3). (B) Precision and recall rates of SLIDE on dataset 4 with paired-end data only (squares), single-end data only (stars), and both (diamonds).
Similar articles
- NMFP: a non-negative matrix factorization based preselection method to increase accuracy of identifying mRNA isoforms from RNA-seq data.
Ye Y, Li JJ. Ye Y, et al. BMC Genomics. 2016 Jan 11;17 Suppl 1(Suppl 1):11. doi: 10.1186/s12864-015-2304-8. BMC Genomics. 2016. PMID: 26818007 Free PMC article. - Freddie: annotation-independent detection and discovery of transcriptomic alternative splicing isoforms using long-read sequencing.
Orabi B, Xie N, McConeghy B, Dong X, Chauve C, Hach F. Orabi B, et al. Nucleic Acids Res. 2023 Jan 25;51(2):e11. doi: 10.1093/nar/gkac1112. Nucleic Acids Res. 2023. PMID: 36478271 Free PMC article. - Improving RNA-Seq expression estimation by modeling isoform- and exon-specific read sequencing rate.
Liu X, Shi X, Chen C, Zhang L. Liu X, et al. BMC Bioinformatics. 2015 Oct 16;16:332. doi: 10.1186/s12859-015-0750-6. BMC Bioinformatics. 2015. PMID: 26475308 Free PMC article. - IsoformEx: isoform level gene expression estimation using weighted non-negative least squares from mRNA-Seq data.
Kim H, Bi Y, Pal S, Gupta R, Davuluri RV. Kim H, et al. BMC Bioinformatics. 2011 Jul 27;12:305. doi: 10.1186/1471-2105-12-305. BMC Bioinformatics. 2011. PMID: 21794104 Free PMC article. - Single-cell RNAseq for the study of isoforms-how is that possible?
Arzalluz-Luque Á, Conesa A. Arzalluz-Luque Á, et al. Genome Biol. 2018 Aug 10;19(1):110. doi: 10.1186/s13059-018-1496-z. Genome Biol. 2018. PMID: 30097058 Free PMC article. Review.
Cited by
- Transcriptomic and metabolic analysis unveils the mechanism behind leaf color development in Disanthus cercidifolius var. longipes.
Tian X, Xiang G, Lv H, Zhu L, Peng J, Li G, Mou C. Tian X, et al. Front Mol Biosci. 2024 Feb 6;11:1343123. doi: 10.3389/fmolb.2024.1343123. eCollection 2024. Front Mol Biosci. 2024. PMID: 38380429 Free PMC article. - A safety framework for flow decomposition problems via integer linear programming.
Dias FHC, Cáceres M, Williams L, Mumey B, Tomescu AI. Dias FHC, et al. Bioinformatics. 2023 Nov 1;39(11):btad640. doi: 10.1093/bioinformatics/btad640. Bioinformatics. 2023. PMID: 37862229 Free PMC article. - Efficient Minimum Flow Decomposition via Integer Linear Programming.
Dias FHC, Williams L, Mumey B, Tomescu AI. Dias FHC, et al. J Comput Biol. 2022 Nov;29(11):1252-1267. doi: 10.1089/cmb.2022.0257. Epub 2022 Oct 18. J Comput Biol. 2022. PMID: 36260412 Free PMC article. - Techniques for Profiling the Cellular Immune Response and Their Implications for Interventional Oncology.
Garg T, Weiss CR, Sheth RA. Garg T, et al. Cancers (Basel). 2022 Jul 26;14(15):3628. doi: 10.3390/cancers14153628. Cancers (Basel). 2022. PMID: 35892890 Free PMC article. Review. - Modern Approaches for Transcriptome Analyses in Plants.
Riaño-Pachón DM, Espitia-Navarro HF, Riascos JJ, Margarido GRA. Riaño-Pachón DM, et al. Adv Exp Med Biol. 2021;1346:11-50. doi: 10.1007/978-3-030-80352-0_2. Adv Exp Med Biol. 2021. PMID: 35113394
References
Publication types
MeSH terms
Substances
Grants and funding
- IK6 RX003836/RX/RRD VA/United States
- HG004695/HG/NHGRI NIH HHS/United States
- S10 RR023707/RR/NCRR NIH HHS/United States
- EY019094/EY/NEI NIH HHS/United States
- R21 EY019094/EY/NEI NIH HHS/United States
- HG005639/HG/NHGRI NIH HHS/United States
- U01 HG004695/HG/NHGRI NIH HHS/United States
- RC2 HG005639/HG/NHGRI NIH HHS/United States
LinkOut - more resources
Full Text Sources
Molecular Biology Databases
Research Materials