Predicting the molecular complexity of sequencing libraries - PubMed (original) (raw)

Predicting the molecular complexity of sequencing libraries

Timothy Daley et al. Nat Methods. 2013 Apr.

Abstract

Predicting the molecular complexity of a genomic sequencing library is a critical but difficult problem in modern sequencing applications. Methods to determine how deeply to sequence to achieve complete coverage or to predict the benefits of additional sequencing are lacking. We introduce an empirical bayesian method to accurately characterize the molecular complexity of a DNA sample for almost any sequencing application on the basis of limited preliminary sequencing.

PubMed Disclaimer

Figures

Figure 1

Two hypothetical libraries containing 10 million (M) distinct molecules. (a) In library 1, half of the molecules (5 M) exist at the same level making up 99 % of the library. (b) In library 2, ten thousand molecule represents half the material in the library. (c) Based on a shallow sequencing run (1 M reads), library 1 appears to contain a greater diversity of molecules. (d) After additional sequencing, library 2 yields more distinct observations. (e) Such situations do occur in practice. Initial observed complexity from 5 M reads for two BS-seq libraries indicates the Human Sperm is the more complex library. Observed library complexity curves cross after additional sequencing, with the Chimp Sperm library yielding more distinct reads. Estimates using Rational Function (RF) and Euler’s transform (ET) fit to initial experiments predict crossing (though ET becomes unstable), while zero-truncated negative binomial (ZTNB) does not.

Figure 2

Library complexity can be estimated both in terms of distinct molecules sequenced and in terms of distinct loci identified. (a) A ChIP-seq library (CTCF; mouse B-Cells) yields additional molecules after sequencing 100 million (M) reads; the RF remains accurate while the ZTNB loses accuracy. (b) In the same library, the number of mapped distinct genomic 1 kb windows saturates after 25 M reads. The rational function approximation (RF) is accurate and forecasts saturation, while the zero-truncated Negative Binomial (ZTNB) significantly overestimates. (c) An RNA-seq (Human adipose-derived mesenchymal stem (ADS) cells) library continues to yield additional molecules after 200 M reads; the RF remains accurate while the ZTNB predicts saturation. (d) In the same library, reads continued mapping to new 300 bp windows after 200 M reads. ZTNB incorrectly predicts saturation, while RF does not.

Cited by

Chromosome-scale genome assembly of the tropical abalone (Haliotis asinina).
Barkan R, Cooke I, Watson SA, Lau SCY, Strugnell JM. Barkan R, et al. Sci Data. 2024 Sep 12;11(1):999. doi: 10.1038/s41597-024-03840-w. Sci Data. 2024. PMID: 39266538 Free PMC article.
Temporally discordant chromatin accessibility and DNA demethylation define short and long-term enhancer regulation during cell fate specification.
Guerin LN, Scott TJ, Yap JA, Johansson A, Puddu F, Charlesworth T, Yang Y, Simmons AJ, Lau KS, Ihrie RA, Hodges E. Guerin LN, et al. bioRxiv [Preprint]. 2024 Aug 27:2024.08.27.609789. doi: 10.1101/2024.08.27.609789. bioRxiv. 2024. PMID: 39253426 Free PMC article. Preprint.
The CALERIE™ Genomic Data Resource.
Ryan CP, Corcoran DL, Banskota N, Eckstein IC, Floratos A, Friedman R, Kobor MS, Kraus VB, Kraus WE, MacIsaac JL, Orenduff MC, Pieper CF, White JP, Ferrucci L, Horvath S, Huffman KM, Belsky DW. Ryan CP, et al. bioRxiv [Preprint]. 2024 Aug 22:2024.05.17.594714. doi: 10.1101/2024.05.17.594714. bioRxiv. 2024. PMID: 39229162 Free PMC article. Updated. Preprint.
Loss of ARID3A perturbs intestinal epithelial proliferation-differentiation ratio and regeneration.
Angelis N, Baulies A, Hubl F, Kucharska A, Kelly G, Llorian M, Boeing S, Li VSW. Angelis N, et al. J Exp Med. 2024 Oct 7;221(10):e20232279. doi: 10.1084/jem.20232279. Epub 2024 Aug 16. J Exp Med. 2024. PMID: 39150450 Free PMC article.
Targeting PRMT5 enhances the radiosensitivity of tumor cells grown in vitro and in vivo.
Degorre C, Lohard S, Bobrek CN, Rawal KN, Kuhn S, Tofilon PJ. Degorre C, et al. Sci Rep. 2024 Jul 27;14(1):17316. doi: 10.1038/s41598-024-68405-8. Sci Rep. 2024. PMID: 39068290 Free PMC article.

References

1. Lander E, Waterman M. Genomics. 1988;2:231–239. - PubMed
1. Chen Y, et al. Nat. Methods. 2012;9:609–614. - PMC - PubMed
1. Fisher RA, Corbet S, Williams CB. J. Anim. Ecol. 1943;12:42–58.
1. Good IJ. Biometrika. 1953;40:237–264.
1. Kivioja T, et al. Nat. Methods. 2012;9:72–74. - PubMed

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations

Predicting the molecular complexity of sequencing libraries - PubMed (original) (raw)