Overcoming bias and systematic errors in next generation sequencing data - PubMed (original) (raw)
Editorial
Overcoming bias and systematic errors in next generation sequencing data
Margaret A Taub et al. Genome Med. 2010.
Abstract
Considerable time and effort has been spent in developing analysis and quality assessment methods to allow the use of microarrays in a clinical setting. As is the case for microarrays and other high-throughput technologies, data from new high-throughput sequencing technologies are subject to technological and biological biases and systematic errors that can impact downstream analyses. Only when these issues can be readily identified and reliably adjusted for will clinical applications of these new technologies be feasible. Although much work remains to be done in this area, we describe consistently observed biases that should be taken into account when analyzing high-throughput sequencing data. In this article, we review current knowledge about these biases, discuss their impact on analysis results, and propose solutions.
Figures
Figure 1
Effect of base-calling improvements on error bias. This figure is based on figures from Bravo and Irizarry [15]. Choosing a site that was a false-positive variant as determined by MAQ [28], the authors examined the pattern of nucleotide calls according to the read cycle the different calls occurred at. (a) Results with the default base-calling software; (b) results after application of the base-calling method of Bravo and Irizarry. The _x_-axis shows read cycle and the colored points indicate the percentage of calls at each cycle that were made for a particular nucleotide. In (a), the letter T becomes much more frequent in reads that align to the SNP site only at later sequencing cycles, indicating a technical bias in base calls at this position, while the plot in (b) shows a strong reduction in this bias. In addition, the location is no longer determined as a variant by MAQ after the improved base calling.
Figure 2
Effect of mappability and GC content on coverage. (a) Mean tag counts in 50-bp bins, with error bars, from a naked DNA sample from a ChIP-Seq experiment, showing that they depend on mappability and GC content. (b) 97.4% of bins have GC percentages between 0.2% and 0.56%, as marked by the vertical dashed lines. This figure is reproduced with permission from Kuan et al. [21].
Figure 3
Batch effect for second-generation sequencing data from the 1000 Genomes Project. This figure is similar to one from Leek et al. [10]. Each row in the heat-map is data from a different HapMap sample processed in the same facility with the same platform (see Leek et al. [10] for a description of the data), shown for a 3-Mb region on chromosome 16, with data summarized in 10-kb bins. Data from each bin were standardized across samples, with blue representing 3 standard deviations below average, and orange representing 3 standard deviations above average. The rows are ordered by date, with black lines separating different processing days. The largest batch effect can be seen on the alternating pattern of blue and orange on days 223 to 241 and days 244 to 251.
Similar articles
- Detecting and overcoming systematic bias in high-throughput screening technologies: a comprehensive review of practical issues and methodological solutions.
Caraus I, Alsuwailem AA, Nadon R, Makarenkov V. Caraus I, et al. Brief Bioinform. 2015 Nov;16(6):974-86. doi: 10.1093/bib/bbv004. Epub 2015 Mar 7. Brief Bioinform. 2015. PMID: 25750417 Review. - The future of Cochrane Neonatal.
Soll RF, Ovelman C, McGuire W. Soll RF, et al. Early Hum Dev. 2020 Nov;150:105191. doi: 10.1016/j.earlhumdev.2020.105191. Epub 2020 Sep 12. Early Hum Dev. 2020. PMID: 33036834 - Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.
Crider K, Williams J, Qi YP, Gutman J, Yeung L, Mai C, Finkelstain J, Mehta S, Pons-Duran C, Menéndez C, Moraleda C, Rogers L, Daniels K, Green P. Crider K, et al. Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217. Cochrane Database Syst Rev. 2022. PMID: 36321557 Free PMC article. - Prevention, diagnosis and treatment of high-throughput sequencing data pathologies.
Zhou X, Rokas A. Zhou X, et al. Mol Ecol. 2014 Apr;23(7):1679-700. doi: 10.1111/mec.12680. Epub 2014 Mar 13. Mol Ecol. 2014. PMID: 24471475 Review. - Detection and removal of biases in the analysis of next-generation sequencing reads.
Schwartz S, Oren R, Ast G. Schwartz S, et al. PLoS One. 2011 Jan 31;6(1):e16685. doi: 10.1371/journal.pone.0016685. PLoS One. 2011. PMID: 21304912 Free PMC article.
Cited by
- Exploring the impact of sequence context on errors in SNP genotype calling with whole genome sequencing data using AI-based autoencoder approach.
Kotlarz K, Mielczarek M, Biecek P, Guldbrandtsen B, Szyda J. Kotlarz K, et al. NAR Genom Bioinform. 2024 Sep 24;6(3):lqae131. doi: 10.1093/nargab/lqae131. eCollection 2024 Sep. NAR Genom Bioinform. 2024. PMID: 39318508 Free PMC article. - Evolution of the Mutation Spectrum Across a Mammalian Phylogeny.
Beichman AC, Robinson J, Lin M, Moreno-Estrada A, Nigenda-Morales S, Harris K. Beichman AC, et al. Mol Biol Evol. 2023 Oct 4;40(10):msad213. doi: 10.1093/molbev/msad213. Mol Biol Evol. 2023. PMID: 37770035 Free PMC article. - Data Mining of Microarray Datasets in Translational Neuroscience.
O'Connor LM, O'Connor BA, Zeng J, Lo CH. O'Connor LM, et al. Brain Sci. 2023 Sep 14;13(9):1318. doi: 10.3390/brainsci13091318. Brain Sci. 2023. PMID: 37759919 Free PMC article. Review. - "Evolution of the mutation spectrum across a mammalian phylogeny".
Beichman AC, Robinson J, Lin M, Moreno-Estrada A, Nigenda-Morales S, Harris K. Beichman AC, et al. bioRxiv [Preprint]. 2023 Jun 1:2023.05.31.543114. doi: 10.1101/2023.05.31.543114. bioRxiv. 2023. PMID: 37398383 Free PMC article. Updated. Preprint. - The impact of sequencing depth and relatedness of the reference genome in population genomic studies: A case study with two caddisfly species (Trichoptera, Rhyacophilidae, Himalopsyche).
Deng XL, Frandsen PB, Dikow RB, Favre A, Shah DN, Shah RDT, Schneider JV, Heckenhauer J, Pauls SU. Deng XL, et al. Ecol Evol. 2022 Dec 12;12(12):e9583. doi: 10.1002/ece3.9583. eCollection 2022 Dec. Ecol Evol. 2022. PMID: 36523526 Free PMC article.
References
- Irizarry RA, Warren D, Spencer F, Kim IF, Biswal S, Frank BC, Gabrielson E, Garcia JG, Geoghegan J, Germino G, Griffin C, Hilmer SC, Hoffman E, Jedlicka AE, Kawasaki E, Martinez-Murillo F, Morsberger L, Lee H, Petersen D, Quackenbush J, Scott A, Wilson M, Yang Y, Ye SQ, Yu W. Multiple-laboratory comparison of microarray platforms. Nat Methods. 2005;2:345–350. doi: 10.1038/nmeth756. - DOI - PubMed
- Shi L, Reid LH, Jones WD, Shippy R, Warrington JA, Baker SC, Collins PJ, de Longueville F, Kawasaki ES, Lee KY, Luo Y, Sun YA, Willey JC, Setterquist RA, Fischer GM, Tong W, Dragan YP, Dix DJ, Frueh FW, Goodsaid FM, Herman D, Jensen RV, Johnson CD, Lobenhofer EK, Puri RK, Schrf U, Thierry-Mieg J, Wang C, Wilson M, Wolber PK. et al.The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. Nat Biotechnol. 2006;24:1151–1161. doi: 10.1038/nbt1239. - DOI - PMC - PubMed
Publication types
LinkOut - more resources
Full Text Sources
Other Literature Sources
Research Materials