Alta-Cyclic: a self-optimizing base caller for next-generation sequencing - PubMed (original) (raw)

Alta-Cyclic: a self-optimizing base caller for next-generation sequencing

Yaniv Erlich et al. Nat Methods. 2008 Aug.

Abstract

Next-generation sequencing is limited to short read lengths and by high error rates. We systematically analyzed sources of noise in the Illumina Genome Analyzer that contribute to these high error rates and developed a base caller, Alta-Cyclic, that uses machine learning to compensate for noise factors. Alta-Cyclic substantially improved the number of accurate reads for sequencing runs up to 78 bases and reduced systematic biases, facilitating confident identification of sequence variants.

PubMed Disclaimer

Figures

Figure 1

Figure 1

Schematic representation of main Illumina noise factors. (ad)A DNA cluster comprises identical DNA templates (colored boxes) that are attached to the flow cell. Nascent strands (black boxes) and DNA polymerase (black ovals) are depicted. In the ideal situation, after several cycles the signal (green arrows) is strong, coherent and corresponds to the interrogated position (a). Phasing noise introduces lagging (blue arrows) and leading (red arrow) nascent strands, which transmit a mixture of signals (b). Fading is attributed to loss of material that reduces the signal intensity (c). Changes in the fluorophore cross-talk cause misinterpretation of the received signal (teal arrows; d). For simplicity, the noise factors are presented separately from each other.

Figure 2

Figure 2

Alta-Cyclic base caller data flow. The training process (green arrows) starts with creation of the training set, beginning with sequences generated by the standard Illumina pipeline, by linking intensity reads and a corresponding genome sequence (the ‘correct’ sequence). Then, two grid searches are used to optimize the parameters to call the bases. After optimization, a final SVM array is created, each of which corresponds to a cycle. In the base-calling stage (blue arrows), the intensity files of the desired library undergo deconvolution to correct for phasing noise using the optimized values and are sent for classification with the SVM array. The output is processed, and sequences and quality scores are reported.

Figure 3

Figure 3

Comparison between Alta-Cyclic and Illumina base caller on the GAII platform. (a) Analysis of the HepG2 RNA library using Alta-Cyclic. The absolute number of additional fully correct reads (in addition to those generated by the Illumina base caller) is indicated by the red line; the fold change of the improvement is indicated by the blue bars. (b) A comparison of fully correct reads for the Tetrahymena micronuclear library by the Illumina base caller and Alta-Cyclic. (c) The average error rate in calls of the artificial SNP locations in the phi X library as a function of the cycle in which they were called. The dashed line represents 1% error rate (Q20). The plot on the right shows the last 18 cycles in a different scale. (d) A comparison of fully correct reads for the phi X library with 1% artificial SNPs. (e) Phi X sequences generated by Alta-Cyclic or Illumina were exhaustively aligned to the reference genome (allowing up to 53 mismatches out of 78). The distribution of alignment scores is shown beginning with an identical number of raw reads for input into each base caller.

Similar articles

Cited by

References

    1. Pennisi E. Science. 2007;318:1842–1843. - PubMed
    1. Chi KR. Nat. Methods. 2008;5:11–14. - PubMed
    1. Korbel JO, et al. Science. 2007;318:420–426. - PMC - PubMed
    1. Hillier LW, et al. Nat. Methods. 2008;5:183–188. - PubMed
    1. Cokus SJ, et al. Nature. 2008;452:215–219. - PMC - PubMed

Publication types

MeSH terms

Grants and funding

LinkOut - more resources