Predicting the molecular complexity of sequencing libraries - PubMed (original) (raw)

Predicting the molecular complexity of sequencing libraries

Timothy Daley et al. Nat Methods. 2013 Apr.

Abstract

Predicting the molecular complexity of a genomic sequencing library is a critical but difficult problem in modern sequencing applications. Methods to determine how deeply to sequence to achieve complete coverage or to predict the benefits of additional sequencing are lacking. We introduce an empirical bayesian method to accurately characterize the molecular complexity of a DNA sample for almost any sequencing application on the basis of limited preliminary sequencing.

PubMed Disclaimer

Figures

Figure 1

Figure 1

Two hypothetical libraries containing 10 million (M) distinct molecules. (a) In library 1, half of the molecules (5 M) exist at the same level making up 99 % of the library. (b) In library 2, ten thousand molecule represents half the material in the library. (c) Based on a shallow sequencing run (1 M reads), library 1 appears to contain a greater diversity of molecules. (d) After additional sequencing, library 2 yields more distinct observations. (e) Such situations do occur in practice. Initial observed complexity from 5 M reads for two BS-seq libraries indicates the Human Sperm is the more complex library. Observed library complexity curves cross after additional sequencing, with the Chimp Sperm library yielding more distinct reads. Estimates using Rational Function (RF) and Euler’s transform (ET) fit to initial experiments predict crossing (though ET becomes unstable), while zero-truncated negative binomial (ZTNB) does not.

Figure 2

Figure 2

Library complexity can be estimated both in terms of distinct molecules sequenced and in terms of distinct loci identified. (a) A ChIP-seq library (CTCF; mouse B-Cells) yields additional molecules after sequencing 100 million (M) reads; the RF remains accurate while the ZTNB loses accuracy. (b) In the same library, the number of mapped distinct genomic 1 kb windows saturates after 25 M reads. The rational function approximation (RF) is accurate and forecasts saturation, while the zero-truncated Negative Binomial (ZTNB) significantly overestimates. (c) An RNA-seq (Human adipose-derived mesenchymal stem (ADS) cells) library continues to yield additional molecules after 200 M reads; the RF remains accurate while the ZTNB predicts saturation. (d) In the same library, reads continued mapping to new 300 bp windows after 200 M reads. ZTNB incorrectly predicts saturation, while RF does not.

Similar articles

Cited by

References

    1. Lander E, Waterman M. Genomics. 1988;2:231–239. - PubMed
    1. Chen Y, et al. Nat. Methods. 2012;9:609–614. - PMC - PubMed
    1. Fisher RA, Corbet S, Williams CB. J. Anim. Ecol. 1943;12:42–58.
    1. Good IJ. Biometrika. 1953;40:237–264.
    1. Kivioja T, et al. Nat. Methods. 2012;9:72–74. - PubMed

Publication types

MeSH terms

LinkOut - more resources