Near-optimal probabilistic RNA-seq quantification (original) (raw)

Change history

In the version of this article initially published, in the HTML version only, the equation “α t N > 0.01” was written as “α tN > 0.01.” In addition, in the Figure 1 legend, the formatting of the nodes was incorrect (v_1, etc., rather than _v_1). The errors have been corrected in the HTML and PDF versions of the article.

References

  1. Kim, D. et al. Genome Biol. 14, R36 (2013).
    Article Google Scholar
  2. Trapnell, C. et al. Nat. Biotechnol. 28, 511–515 (2010).
    Article CAS Google Scholar
  3. Roberts, A. & Pachter, L. Nat. Methods 10, 71–73 (2013).
    Article CAS Google Scholar
  4. Anders, S., Pyl, P.T. & Huber, W. Bioinformatics 31, 166–169 (2015).
    Article CAS Google Scholar
  5. Patro, R., Mount, S.M. & Kingsford, C. Nat. Biotechnol. 32, 462–464 (2014).
    Article CAS Google Scholar
  6. Mortazavi, A., Williams, B.A., McCue, K., Schaeffer, L. & Wold, B. Nat. Methods 5, 621–628 (2008).
    Article CAS Google Scholar
  7. Nicolae, M., Mangul, S., Măndoiu, I. & Zelikovsky, A. in Algorithms in Bioinformatics (eds. Moulton, V. & Singh, M.) 202–214 (Springer, 2010).
  8. Compeau, P.E.C., Pevzner, P.A. & Tesler, G. Nat. Biotechnol. 29, 987–991 (2011).
    Article CAS Google Scholar
  9. Li, B. & Dewey, C.N. BMC Bioinformatics 12, 323 (2011).
    Article CAS Google Scholar
  10. SEQC/MAQC-III Consortium. Nat. Biotechnol. 32, 903–914 (2014).
  11. Lappalainen, T. et al. Nature 501, 506–511 (2013).
    Article CAS Google Scholar
  12. Roberts, A., Trapnell, C., Donaghey, J., Rinn, J.L. & Pachter, L. Genome Biol. 12, R22 (2011).
    Article CAS Google Scholar
  13. Marioni, J.C., Mason, C.E., Mane, S.M., Stephens, M. & Gilad, Y. Genome Res. 18, 1509–1517 (2008).
    Article CAS Google Scholar
  14. Wold, B. & Myers, R.M. Nat. Methods 5, 19–21 (2008).
    Article CAS Google Scholar
  15. Iqbal, Z., Caccamo, M., Turner, I., Flicek, P. & McVean, G. Nat. Genet. 44, 226–232 (2012).
    Article CAS Google Scholar
  16. Lee, S., Seo, C.H., Alver, B.H., Lee, S. & Park, P.J. BMC Bioinformatics 16, 278 (2015).
    Article Google Scholar
  17. Köster, J. & Rahmann, S. Bioinformatics 28, 2520–2522 (2012).
    Article Google Scholar

Download references

Acknowledgements

N.L.B., H.P. and L.P. were partially funded by NIH R01 HG006129. P.M. was partially funded by a Fulbright fellowship.

Author information

Authors and Affiliations

  1. Innovative Genomics Initiative, University of California, Berkeley, California, USA.,
    Nicolas L Bray
  2. Department of Computer Science, University of California, Berkeley, California, USA
    Harold Pimentel & Lior Pachter
  3. Faculty of Industrial Engineering, Mechanical Engineering and Computer Science, University of Iceland, Reykjavik, Iceland
    Páll Melsted
  4. Department of Mathematics, University of California, Berkeley, California, USA
    Lior Pachter
  5. Department of Molecular & Cell Biology, University of California, Berkeley, California, USA.,
    Lior Pachter

Authors

  1. Nicolas L Bray
    You can also search for this author inPubMed Google Scholar
  2. Harold Pimentel
    You can also search for this author inPubMed Google Scholar
  3. Páll Melsted
    You can also search for this author inPubMed Google Scholar
  4. Lior Pachter
    You can also search for this author inPubMed Google Scholar

Contributions

N.L.B. and L.P. developed the concept of pseudoalignment and conceived the idea for applying it to RNA-seq quantification. P.M. conceived the implementation using De Bruijn graphs. N.L.B., H.P., P.M. and L.P. designed the kallisto software and N.L.B. implemented a prototype. H.P. and P.M. wrote the current kallisto implementation. N.B. and H.P. automated production of the results. N.L.B., H.P., P.M. and L.P. analyzed results and wrote the paper.

Corresponding author

Correspondence toLior Pachter.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Integrated supplementary information

Supplementary Figure 1 Median relative difference for abundance estimates using varying values of k.

Median relative difference for abundance estimates using varying values of k on a dataset of 30 million 75bp paired-end reads that were simulated without errors. The “_k_-mers method” uses the _k_-compatibility of each _k_-mer independently and runs the EM algorithm on _k_-mers, whereas kallisto uses the intersection of _k_-compatibility classes across both ends of a read. Even for _k_=75, the full read length in the simulation, independent use of _k_-mers results in a significant drop in accuracy due to the loss of paired-end information.

Accuracy of kallisto, Cufflinks, Sailfish, eXpress and RSEM on 20 RSEM simulations of 30 million 75bp paired-end reads based on the TPM estimates and error profile of Geuvadis sample NA12716 (selected for its depth of sequencing). For each simulation we report the accuracy as the median relative difference in the estimated TPM value of each transcript. The values reported are means across the 20 simulations (the variance was too small for this plot). Relative difference is defined as the absolute difference between the estimated TPM values and the ground truth divided by the average of the two.

Supplementary Figure 3 Performance of different quantification programs on the set of paralogs in the human genome.

Performance of different quantification programs on the set of paralogs in the human genome supplied by the Duplicated Genes Database (http://dgd.genouest.org). This set includes 8,636 transcripts in 3,163 genes.

Supplementary Figure 4 Count distribution of one simulation.

Count distribution of one simulation. The left panel contains the transcripts used in Supplementary Figure 3. The right panel contains the remaining transcripts. The x-axis is on the log scale. Both distributions appear very similar, suggesting that the drop in performance in Supplementary Figure 3 is from sequence similarity and not oddities in the distribution such as very low counts.

Supplementary Figure 5 Comparison of technical variance in abundances.

The data comes from a single library with 216M, 101bp paired-end reads sequenced. Each point corresponds to a transcript and is colored by the decile of its expression level in the single bootstrapped subsample. The Y-axis represents variance of abundance estimates across 40 subsamples, with 30M reads in each subsample. The X-axis represents variance as computed from 40 bootstraps of a single subsampled dataset of 30M reads. The red lines emanating from the lower left corner consist of transcripts that have an estimated abundance of zero in the single bootstrapped experiment, but show expression in some of the subsamples (12968 transcripts), and vice versa (720 transcripts).

Supplementary Figure 6 Median relative error (with respect to 1,000 bootstraps) of inferred transcript variances.

Median relative error (with respect to 1000 bootstraps) of inferred transcript variances as a function of number of bootstrap samples performed. The relative error with 40 bootstraps (red line) is 7.8%.

Supplementary Figure 7 Relationship between the mean and variance of estimated counts from subsamples.

Relationship between the mean and variance of estimated counts for each transcript (x and y axes are on log scale) based on 40 subsamples of 30M reads from a dataset of 216M PE reads. The _x_-axis is the mean of each count estimate calculated across the subsamples. The _y-_axis is the variance of the count estimates calculated across subsamples.

Supplementary Figure 8 Relationship between the mean and variance of estimated counts from bootstraps.

Relationship between the mean and variance of estimated counts for each transcript (x and y axes are on log scale) based on 40 bootstraps of a single subsample of 30M reads from the same 216M PE read dataset. The _x_-axis is the mean of the count estimates calculated across the 40 bootstraps. The _y_-axis is the variance of the count estimates calculated across the 40 bootstraps.

Supplementary Figure 9 Median relative difference from 30 million 75-bp PE reads simulated with error for different values of k.

Median relative difference from 30M 75bp PE reads simulated with error for different values of k. The “_k_-mers method” uses the _k_-compatibility of each _k_-mer independently and runs the EM algorithm on _k_-mers, whereas kallisto uses the intersection of _k_-compatibility classes across both ends of reads. When there are errors in the reads, kallisto requires smaller _k-_mer lengths for robustness in pseudoalignment.

Supplementary Figure 10 Run time for index building and quantification

Run time for index building and quantification as a function of k-mer length for one of the simulated samples.

Supplementary Figure 11 The distribution of the number of _k_-mers hashed per read.

The distribution of the number of _k_-mers hashed per read for k=31. Note that for the majority of reads (61.35%) only two _k_-mers are hashed. This happens when the entire read pseudoaligns to a single contig of the T-DBG and we can skip to the end of the read. Since we also check the last _k_-mer we can skip over, the most common cases are checking 2, 4, 6, and 8 _k_-mers. Only 1.6% of reads required hashing every _k_-mer of the read.

Supplementary information

Rights and permissions

About this article

Cite this article

Bray, N., Pimentel, H., Melsted, P. et al. Near-optimal probabilistic RNA-seq quantification.Nat Biotechnol 34, 525–527 (2016). https://doi.org/10.1038/nbt.3519

Download citation