MITIE: Simultaneous RNA-Seq-based transcript identification and quantification in multiple samples - PubMed (original) (raw)

MITIE: Simultaneous RNA-Seq-based transcript identification and quantification in multiple samples

Jonas Behr et al. Bioinformatics. 2013.

Abstract

Motivation: High-throughput sequencing of mRNA (RNA-Seq) has led to tremendous improvements in the detection of expressed genes and reconstruction of RNA transcripts. However, the extensive dynamic range of gene expression, technical limitations and biases, as well as the observed complexity of the transcriptional landscape, pose profound computational challenges for transcriptome reconstruction.

Results: We present the novel framework MITIE (Mixed Integer Transcript IdEntification) for simultaneous transcript reconstruction and quantification. We define a likelihood function based on the negative binomial distribution, use a regularization approach to select a few transcripts collectively explaining the observed read data and show how to find the optimal solution using Mixed Integer Programming. MITIE can (i) take advantage of known transcripts, (ii) reconstruct and quantify transcripts simultaneously in multiple samples, and (iii) resolve the location of multi-mapping reads. It is designed for genome- and assembly-based transcriptome reconstruction. We present an extensive study based on realistic simulated RNA-Seq data. When compared with state-of-the-art approaches, MITIE proves to be significantly more sensitive and overall more accurate. Moreover, MITIE yields substantial performance gains when used with multiple samples. We applied our system to 38 Drosophila melanogaster modENCODE RNA-Seq libraries and estimated the sensitivity of reconstructing omitted transcript annotations and the specificity with respect to annotated transcripts. Our results corroborate that a well-motivated objective paired with appropriate optimization techniques lead to significant improvements over the state-of-the-art in transcriptome reconstruction.

Availability: MITIE is implemented in C++ and is available from http://bioweb.me/mitie under the GPL license.

PubMed Disclaimer

Figures

Fig. 1.

Splicing graph generation from aligned RNA-Seq reads: 1. Segment identification: Given a genomic region, we construct splicing graphs by generating a list of segment boundaries. Boundaries are either splice sites (SS) depicted as dashed vertical lines, potential transcription start sites (TSS) and termination sites (TTS; both depicted with solid vertical lines). Potential SS positions can originate from spliced reads (e.g. between segments 4 and 5) or annotated transcripts. Analogously, TSS and TTS sites can stem from annotated transcripts or from potential transcript end positions (e.g. between 2 and 3 as well as 13 and 14). See

Supplementary Section B

for more details. 2. Exon identification: We keep (i) segments that have >5% of their nucleotides covered, (ii) are part of annotated transcripts or (iii) if the removal of segment s does not leave any path between two segments connected by paired-end reads (if available). 3. Intron identification: We connect segments based on spliced reads and annotated introns

Fig. 2.

Illustration of the core optimization problem of MITIE. The transcript matrix U (bottom left) and abundance matrix W (bottom center) will be optimized such that the implied expected read coverage of the k valid transcripts (bottom right) matches the observed coverage (top right) well. Validity of the transcripts is ensured by appropriate constraints derived from the segment graph (top left). We illustrate the case of two samples. For each sample, we have abundance estimates W for each of the k = 4 transcripts. The identity of the transcripts, i.e., the rows of U, is shared among the samples. By Occam’s razor principle, we implement a trade-off between loss between the observed and expected coverages and the number of used transcripts, i.e. number of rows in W with non-zero abundances

Fig. 3.

(A) Example with four samples of simulated reads. All four samples express the same four transcripts (marked with asterisks) with different relative abundances. The different relative abundances lead to distinct coverage patterns in the alternative regions. (B) We randomly selected 2 (top), 3 (middle) and 4 (bottom) transcripts and simulated four samples RNA-Seq reads each. For each sample, we uniformly redistributed the abundance between the selected transcripts. We then predicted transcripts with different methods. The prediction was counted as correct if all transcripts were exactly matched and no additional transcripts were predicted. To obtain more robust measurements, we repeated the whole procedure 50 times and report the mean number of correct predictions for each method

Fig. 4.

(A) MITIE quantification results for the three different loss functions formula image and -loss. We consider stringent (0 mismatches) and liberal read alignments (up to 5 mismatches), leading to fewer or more multi-mapping reads, respectively. (B) MITIE quantification results with -loss, when considering ground truth alignments, all multiple alignments, or after multi-mapper handling with MMO (see Section 3.9)

Fig. 5.

(A) Transcript-level F-score as a function of the number of samples for the simulated human dataset. (B) Transcript-level F-score as a function of the number of modENCODE samples for up to seven developmental stages of D.melanogaster

Fig. 6.

Cycles show a subset of Trinity model selection runs. We selected the best performing predictions for different trade-offs of sensitivity and specificity. We ran MITIE predictions on the De Bruijn graphs generated by trinity. Dotted lines connect the corresponding predictions

Cited by

Accurate assembly of multiple RNA-seq samples with Aletsch.
Shi Q, Zhang Q, Shao M. Shi Q, et al. Bioinformatics. 2024 Jun 28;40(Suppl 1):i307-i317. doi: 10.1093/bioinformatics/btae215. Bioinformatics. 2024. PMID: 38940157 Free PMC article.
Inference of 3D genome architecture by modeling overdispersion of Hi-C data.
Varoquaux N, Noble WS, Vert JP. Varoquaux N, et al. Bioinformatics. 2023 Jan 1;39(1):btac838. doi: 10.1093/bioinformatics/btac838. Bioinformatics. 2023. PMID: 36594573 Free PMC article.
Jumper enables discontinuous transcript assembly in coronaviruses.
Sashittal P, Zhang C, Peng J, El-Kebir M. Sashittal P, et al. Nat Commun. 2021 Nov 18;12(1):6728. doi: 10.1038/s41467-021-26944-y. Nat Commun. 2021. PMID: 34795232 Free PMC article.
TransBorrow: genome-guided transcriptome assembly by borrowing assemblies from different assemblers.
Yu T, Mu Z, Fang Z, Liu X, Gao X, Liu J. Yu T, et al. Genome Res. 2020 Aug;30(8):1181-1190. doi: 10.1101/gr.257766.119. Epub 2020 Aug 17. Genome Res. 2020. PMID: 32817072 Free PMC article.
AIDE: annotation-assisted isoform discovery with high precision.
Li WV, Li S, Tong X, Deng L, Shi H, Li JJ. Li WV, et al. Genome Res. 2019 Dec;29(12):2056-2072. doi: 10.1101/gr.251108.119. Epub 2019 Nov 6. Genome Res. 2019. PMID: 31694868 Free PMC article.

References

1. Anders S, Huber W. Differential expression analysis for sequence count data. Genome Biol. 2010;11:R106. - PMC - PubMed
1. Anders S, et al. Detecting differential usage of exons from RNA-seq data. Genome Res. 2012;22:2008–2017. - PMC - PubMed
1. Bahn JH, et al. Accurate identification of a-to-i rna editing in human by transcriptome sequencing. Genome Res. 2012;22:142–50. - PMC - PubMed
1. Bohnert R. Computational methods for high-throughput genomics and transcriptomics. 2011 Ph.D. Thesis, Eberhard Karls Universität, Tübingen. http://tobias-lib.uni-tuebingen.de/volltexte2011/5918/pdf/Dissertation_R....
1. Bohnert R, et al. Transcript quantification with RNA-Seq data. BMC Bioinformatics. 2009;10(Suppl. 13):P5.

MITIE: Simultaneous RNA-Seq-based transcript identification and quantification in multiple samples - PubMed (original) (raw)