Overcoming bias and systematic errors in next generation sequencing data - PubMed (original) (raw)

Editorial

Overcoming bias and systematic errors in next generation sequencing data

Margaret A Taub et al. Genome Med. 2010.

Abstract

Considerable time and effort has been spent in developing analysis and quality assessment methods to allow the use of microarrays in a clinical setting. As is the case for microarrays and other high-throughput technologies, data from new high-throughput sequencing technologies are subject to technological and biological biases and systematic errors that can impact downstream analyses. Only when these issues can be readily identified and reliably adjusted for will clinical applications of these new technologies be feasible. Although much work remains to be done in this area, we describe consistently observed biases that should be taken into account when analyzing high-throughput sequencing data. In this article, we review current knowledge about these biases, discuss their impact on analysis results, and propose solutions.

PubMed Disclaimer

Figures

Figure 1

Figure 1

Effect of base-calling improvements on error bias. This figure is based on figures from Bravo and Irizarry [15]. Choosing a site that was a false-positive variant as determined by MAQ [28], the authors examined the pattern of nucleotide calls according to the read cycle the different calls occurred at. (a) Results with the default base-calling software; (b) results after application of the base-calling method of Bravo and Irizarry. The _x_-axis shows read cycle and the colored points indicate the percentage of calls at each cycle that were made for a particular nucleotide. In (a), the letter T becomes much more frequent in reads that align to the SNP site only at later sequencing cycles, indicating a technical bias in base calls at this position, while the plot in (b) shows a strong reduction in this bias. In addition, the location is no longer determined as a variant by MAQ after the improved base calling.

Figure 2

Figure 2

Effect of mappability and GC content on coverage. (a) Mean tag counts in 50-bp bins, with error bars, from a naked DNA sample from a ChIP-Seq experiment, showing that they depend on mappability and GC content. (b) 97.4% of bins have GC percentages between 0.2% and 0.56%, as marked by the vertical dashed lines. This figure is reproduced with permission from Kuan et al. [21].

Figure 3

Figure 3

Batch effect for second-generation sequencing data from the 1000 Genomes Project. This figure is similar to one from Leek et al. [10]. Each row in the heat-map is data from a different HapMap sample processed in the same facility with the same platform (see Leek et al. [10] for a description of the data), shown for a 3-Mb region on chromosome 16, with data summarized in 10-kb bins. Data from each bin were standardized across samples, with blue representing 3 standard deviations below average, and orange representing 3 standard deviations above average. The rows are ordered by date, with black lines separating different processing days. The largest batch effect can be seen on the alternating pattern of blue and orange on days 223 to 241 and days 244 to 251.

Similar articles

Cited by

References

    1. Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics. 2003;4:249–264. doi: 10.1093/biostatistics/4.2.249. - DOI - PubMed
    1. Cope LM, Irizarry RA, Jaffee HA, Wu Z, Speed TP. A benchmark for Affymetrix GeneChip expression measures. Bioinformatics. 2004;20:323–331. doi: 10.1093/bioinformatics/btg410. - DOI - PubMed
    1. Irizarry RA, Wu Z, Jaffee HA. Comparison of Affymetrix GeneChip expression measures. Bioinformatics. 2006;22:789–794. doi: 10.1093/bioinformatics/btk046. - DOI - PubMed
    1. Irizarry RA, Warren D, Spencer F, Kim IF, Biswal S, Frank BC, Gabrielson E, Garcia JG, Geoghegan J, Germino G, Griffin C, Hilmer SC, Hoffman E, Jedlicka AE, Kawasaki E, Martinez-Murillo F, Morsberger L, Lee H, Petersen D, Quackenbush J, Scott A, Wilson M, Yang Y, Ye SQ, Yu W. Multiple-laboratory comparison of microarray platforms. Nat Methods. 2005;2:345–350. doi: 10.1038/nmeth756. - DOI - PubMed
    1. Shi L, Reid LH, Jones WD, Shippy R, Warrington JA, Baker SC, Collins PJ, de Longueville F, Kawasaki ES, Lee KY, Luo Y, Sun YA, Willey JC, Setterquist RA, Fischer GM, Tong W, Dragan YP, Dix DJ, Frueh FW, Goodsaid FM, Herman D, Jensen RV, Johnson CD, Lobenhofer EK, Puri RK, Schrf U, Thierry-Mieg J, Wang C, Wilson M, Wolber PK. et al.The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. Nat Biotechnol. 2006;24:1151–1161. doi: 10.1038/nbt1239. - DOI - PMC - PubMed

Publication types

LinkOut - more resources