Sequence-specific error profile of Illumina sequencers - PubMed (original) (raw)

doi: 10.1093/nar/gkr344. Epub 2011 May 16.

Taku Oshima, Takuya Morimoto, Shun Ikeda, Hirofumi Yoshikawa, Yuh Shiwa, Shu Ishikawa, Margaret C Linak, Aki Hirai, Hiroki Takahashi, Md Altaf-Ul-Amin, Naotake Ogasawara, Shigehiko Kanaya

Affiliations

Sequence-specific error profile of Illumina sequencers

Kensuke Nakamura et al. Nucleic Acids Res. 2011 Jul.

Abstract

We identified the sequence-specific starting positions of consecutive miscalls in the mapping of reads obtained from the Illumina Genome Analyser (GA). Detailed analysis of the miscall pattern indicated that the underlying mechanism involves sequence-specific interference of the base elongation process during sequencing. The two major sequence patterns that trigger this sequence-specific error (SSE) are: (i) inverted repeats and (ii) GGC sequences. We speculate that these sequences favor dephasing by inhibiting single-base elongation, by: (i) folding single-stranded DNA and (ii) altering enzyme preference. This phenomenon is a major cause of sequence coverage variability and of the unfavorable bias observed for population-targeted methods such as RNA-seq and ChIP-seq. Moreover, SSE is a potential cause of false single-nucleotide polymorphism (SNP) calls and also significantly hinders de novo assembly. This article highlights the importance of recognizing SSE and its underlying mechanisms in the hope of enhancing the potential usefulness of the Illumina sequencers.

PubMed Disclaimer

Figures

Figure 1.

Figure 1.

(i) First segment of the mapping results obtained from Illumina sequencing runs for (a) B. subtilis, (b) M. bovis and (c) B. pertussis, generated using MPSmap and PSmap allowing 35 mismatches per read. Pale blue lines associated with the gene ID and name indicate gene areas. Magenta arrows with SSE signs indicate the positions of visually identified SSE. Green arrows indicate the positions of SNPs. SSE positions automatically detected are accompanied by numbers, which indicate the reference positions. For (b) and (c), mappings with the first 10 million reads are displayed. (ii) The average base call quality for all aligned bases at each reference position. The blue plot indicates forward reads, and the green plot, reverse reads. (iii) Ratio of the number of mismatches between reference and reads to the number of all mapped bases at each reference position. The magenta plot indicates forward reads, and the orange plot, reverse reads.

Figure 2.

Figure 2.

Examples of SSE and SNP positions in mapping of B. subtilis. Each drawing displays areas with (a) an SSE position, (b) two overlapping SSE positions with inverted repeat, (c) an SSE resembling an SNP and (d) true SNPs.

Figure 3.

Figure 3.

First 20 SSE positions of B. subtilis automatically detected in the (a) forward and (b) backward directions. The numbers in the left column indicate the genome coordinate of each SSE position. For each row, the base next to the vertical red line is the SSE position.

Figure 4.

Figure 4.

(a) Base-wise view of a part of the B. subtilis mapping result and (b) the alignment of the reference and the read in the middle row indicated by an arrow. The gray dotted lines show the match, whereas the pink dotted lines show the influence of previous base calls on mismatches.

Figure 5.

Figure 5.

Plots of (a) average base call quality and (b) mismatch ratio along the sequencing cycle. Quality value of B. subtilis is based on the Illumina/Solexa standard protocol, while other data are PHREAD-type scores (30).

Figure 6.

Figure 6.

Schematic representation of the (a) inverted repeat and (b) enzyme preference for the SSE hypothetical mechanistic models. The gray numbers at the top indicate the cycle number and the numbers below indicate the relative population of each single-stranded DNA during the cycle. The colored bases and numbers below the drawings show the relative intensity of signals during that cycle. For instance, the second cycle of model (a) emits signals for C and G with an intensity of 73 and 27%, respectively.

Figure 7.

Figure 7.

Comparison of coverage between (i) mapping allowing 35 mismatches, (ii) mapping allowing 2 mismatches and (iii) mapping of truncated reads using the first 35 bp, allowing 2 mismatches. Each drawing shows areas of the M. bovis genome including (a) an SSE position, (b) overlapping SSE positions in opposite directions associated with inverted repeats, and (c) multiple overlapping SSE positions. Mappings were carried out with MPSmap and PSmap for the first 10 million reads.

Similar articles

Cited by

References

    1. Quail MA, Kozarewa I, Smith F, Scally A, Stephens PJ, Durbin R, Swerdlow H, Turner DJ. A large genome center’s improvements to the Illumina sequencing system. Nat. Methods. 2008;5:1005–1010. - PMC - PubMed
    1. Li R, Fan W, Tian G, Zhu H, He L, Cai J, Huang Q, Cai Q, Li B, Bai Y, et al. The sequence and de novo assembly of the giant panda genome. Nature. 2010;463:311–317. - PMC - PubMed
    1. Fujimoto A, Nakagawa H, Hosono N, Nakano K, Abe T, Boroevich KA, Nagasaki M, Yamaguchi R, Shibuya T, Kubo M, et al. Whole-genome sequencing and comprehensive variant analysis of a Japanese individual using massively parallel sequencing. Nat. Genet. 2010;42:931–936. - PubMed
    1. Bennett S. Solexa Ltd. Pharmacogenomics. 2004;5:433–438. - PubMed
    1. Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA, Berka J, Braverman MS, Chen Y-J, Chen Z, et al. Genome sequencing in microfabricated high-density picolitre reactors. Nature. 2005;437:376–380. - PMC - PubMed

Publication types

MeSH terms

LinkOut - more resources