Substantial biases in ultra-short read data sets from high-throughput DNA sequencing - PubMed (original) (raw)

Substantial biases in ultra-short read data sets from high-throughput DNA sequencing

Juliane C Dohm et al. Nucleic Acids Res. 2008 Sep.

Abstract

Novel sequencing technologies permit the rapid production of large sequence data sets. These technologies are likely to revolutionize genetics and biomedical research, but a thorough characterization of the ultra-short read output is necessary. We generated and analyzed two Illumina 1G ultra-short read data sets, i.e. 2.8 million 27mer reads from a Beta vulgaris genomic clone and 12.3 million 36mers from the Helicobacter acinonychis genome. We found that error rates range from 0.3% at the beginning of reads to 3.8% at the end of reads. Wrong base calls are frequently preceded by base G. Base substitution error frequencies vary by 10- to 11-fold, with A > C transversion being among the most frequent and C > G transversions among the least frequent substitution errors. Insertions and deletions of single bases occur at very low rates. When simulating re-sequencing we found a 20-fold sequencing coverage to be sufficient to compensate errors by correct reads. The read coverage of the sequenced regions is biased; the highest read density was found in intervals with elevated GC content. High Solexa quality scores are over-optimistic and low scores underestimate the data quality. Our results show different types of biases and ways to detect them. Such biases have implications on the use and interpretation of Solexa data, for de novo sequencing, re-sequencing, the identification of single nucleotide polymorphisms and DNA methylation sites, as well as for transcriptome analysis.

PubMed Disclaimer

Figures

Figure 1.

Pie charts of the read analysis with ELAND. The ELAND categories are: QC: no matching done because of low quality of the read (more than two positions with quality score = −5), NM, no match found; U0, unique exact match found; U1, unique match with one error; U2, unique match with two errors; R0, multiple exact matches found; R1, multiple matches with one error; R2, multiple matches with two errors. The categories R0, R1, R2 are shown as a single entity. (a) ELAND categorizations for 27mer reads from Beta vulgaris clone ZR-47B15 (2 788 286 in total). (b) ELAND categorizations for 32mer reads from Helicobacter acinonychis (12 288 791 in total, trimmed by the last four base calls of the original 36mer data).

Figure 2.

Correlation of the Solexa read coverage and GC content. (a) 27mer reads generated from Beta vulgaris BAC ZR-47B15 (b) 32mer data set from the Helicobacter acinonychis genome. Each data point corresponds to the number of reads recorded for a 1-kbp window (shift of 100 bp in Beta and 1 kbp in Helicobacter).

Figure 3.

Distribution of Solexa reads along the reference sequences considering unique match positions reported by ELAND (zero, one or two mismatch bases) and reads with more than one match position (no mismatch bases) detected with a Perl script. (a) Read distribution along the Beta vulgaris BAC sequence (with cloning vector pBeloBACII). 2 166 892 27mer reads were matched against the finished sequence (enclosed by the cloning vector,∼117 kbp in total). The read coverage was calculated in 200 consecutive 0.58 kbp windows. (b) Read distribution along the 1.55 Mbp Helicobacter genome, based on 8 700 113 32mer reads. The local coverage is shown in 200 consecutive windows of 7.77 kbp.

Figure 4.

Frequency of wrong base calls in Solexa reads depending on the position along the read (27mer reads from Beta vulgaris and 32mer reads from Helicobacter). (a) Error frequency per position calculated from considering wrong base calls only. The highest error frequency is observed at the read 3′ end. (b) Per-base error rates (overall error frequency per position considering all base calls).

Figure 5.

Compensation of sequencing errors by deep sequencing in re-sequencing projects. The average number of errors per kbp is shown for different levels of coverage. For coverages below 2, reads are unlikely to overlap and compensation of sequencing errors is rare (thus, sequencing errors accumulate when the coverage is increased). For coverages above 3-fold the number of uncompensated errors drops rapidly with the increase of coverage.

Figure 6.

Distance between two errors on a read in the Helicobacter and Beta vulgaris data sets. ‘0’ indicates that the erroneous base calls are next to each other.

Figure 7.

Sequence context of wrong base calls in Solexa reads from Helicobacter acinonychis and Beta vulgaris, considering one base upstream and downstream of the wrong base calls. An ‘e’ indicates the substituted base. The scatterplot shows the correlation of the relative frequencies (relating the frequency of 3-tuples at error positions to the frequency of all 3-tuples in the reads) for the two data sets.

Figure 8.

Frequency of substitution errors in the Helicobacter acinonychis and Beta vulgaris Solexa read data sets.

Figure 9.

Histograms of base quality values for all correct base calls (a) and all wrong base calls (b) in the Beta and Helicobacter data sets.

Cited by

CleanUpRNAseq: An R/Bioconductor Package for Detecting and Correcting DNA Contamination in RNA-Seq Data.
Liu H, Hu K, O'Connor K, Kelliher MA, Zhu LJ. Liu H, et al. BioTech (Basel). 2024 Aug 3;13(3):30. doi: 10.3390/biotech13030030. BioTech (Basel). 2024. PMID: 39189209 Free PMC article.
DTDHM: detection of tandem duplications based on hybrid methods using next-generation sequencing data.
Yuan T, Dong J, Jia B, Jiang H, Zhao Z, Zhou M. Yuan T, et al. PeerJ. 2024 Jul 26;12:e17748. doi: 10.7717/peerj.17748. eCollection 2024. PeerJ. 2024. PMID: 39076774 Free PMC article.
Improving the Accuracy of Bulk Fitness Assays by Correcting Barcode Processing Biases.
McGee RS, Kinsler G, Petrov D, Tikhonov M. McGee RS, et al. Mol Biol Evol. 2024 Aug 2;41(8):msae152. doi: 10.1093/molbev/msae152. Mol Biol Evol. 2024. PMID: 39041198 Free PMC article.
Adaptation to seasonal reproduction and environment-associated factors drive temporal and spatial differentiation in northwest Atlantic herring despite gene flow.
Fuentes-Pardo AP, Stanley R, Bourne C, Singh R, Emond K, Pinkham L, McDermid JL, Andersson L, Ruzzante DE. Fuentes-Pardo AP, et al. Evol Appl. 2024 Mar 14;17(3):e13675. doi: 10.1111/eva.13675. eCollection 2024 Mar. Evol Appl. 2024. PMID: 38495946 Free PMC article.
Data mining reveals tissue-specific expression and host lineage-associated forms of Apis mellifera filamentous virus.
Cornman RS. Cornman RS. PeerJ. 2023 Nov 14;11:e16455. doi: 10.7717/peerj.16455. eCollection 2023. PeerJ. 2023. PMID: 38025724 Free PMC article.

References

1. Kim JB, Porreca GJ, Song L, Greenway SC, Gorham JM, Church GM, Seidman CE, Seidman JG. Polony multiplex analysis of gene expression (PMAGE) in mouse hypertrophic cardiomyopathy. Science. 2007;316:1481–1484. - PubMed
1. Sanger F, Nicklen S, Coulson AR. DNA sequencing with chain-terminating inhibitors. Proc. Natl Acad. Sci. USA. 1977;74:5463–5467. - PMC - PubMed
1. Mullis K, Faloona F, Scharf S, Saiki R, Horn G, Erlich H. Specific enzymatic amplification of DNA in vitro: the polymerase chain reaction. Cold Spring Harb. Symp. Quant. Biol. 1986;51(Pt 1):263–273. - PubMed
1. Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA, Berka J, Braverman MS, Chen YJ, et al. Genome sequencing in microfabricated high-density picolitre reactors. Nature. 2005;437:376–380. - PMC - PubMed
1. Wicker T, Schlagenhauf E, Graner A, Close TJ, Keller B, Stein N. 454 sequencing put to the test using the complex genome of barley. BMC Genomics. 2006;7:275. - PMC - PubMed

Substantial biases in ultra-short read data sets from high-throughput DNA sequencing - PubMed (original) (raw)