A Benchmark Study on Error Assessment and Quality Control of CCS Reads Derived from the PacBio RS - PubMed (original) (raw)

. 2013 Jul 31;4(3):16008.

doi: 10.4172/2153-0602.1000136.

Xin Zheng, Liang Ma, Geetha Kutty, Emile Gogineni, Qiang Sun, Brad T Sherman, Xiaojun Hu, Kristine Jones, Castle Raley, Bao Tran, David J Munroe, Robert Stephens, Dun Liang, Tomozumi Imamichi, Joseph A Kovacs, Richard A Lempicki, Da Wei Huang

Affiliations

A Benchmark Study on Error Assessment and Quality Control of CCS Reads Derived from the PacBio RS

Xiaoli Jiao et al. J Data Mining Genomics Proteomics. 2013.

Abstract

PacBio RS, a newly emerging third-generation DNA sequencing platform, is based on a real-time, single-molecule, nano-nitch sequencing technology that can generate very long reads (up to 20-kb) in contrast to the shorter reads produced by the first and second generation sequencing technologies. As a new platform, it is important to assess the sequencing error rate, as well as the quality control (QC) parameters associated with the PacBio sequence data. In this study, a mixture of 10 prior known, closely related DNA amplicons were sequenced using the PacBio RS sequencing platform. After aligning Circular Consensus Sequence (CCS) reads derived from the above sequencing experiment to the known reference sequences, we found that the median error rate was 2.5% without read QC, and improved to 1.3% with an SVM based multi-parameter QC method. In addition, a De Novo assembly was used as a downstream application to evaluate the effects of different QC approaches. This benchmark study indicates that even though CCS reads are post error-corrected it is still necessary to perform appropriate QC on CCS reads in order to produce successful downstream bioinformatics analytical results.

Keywords: CCS read; PacBio; SVM regression; assembly; pass number; quality control (QC); quality value (QV).

PubMed Disclaimer

Figures

Figure 1

Figure 1

Read accuracy (ABIP) versus quality value (mean QV). The mean QV is correlated with the CCS read accuracy, particularly, at the range of QV-40 and above. It suggests that QV can be a useful QC parameter to remove low-quality CCS reads. However, the plot also shows that the majority of CCS reads, including the low-quality CCS reads, are below QV-40. Thus, a QV-40 cutoff might have a higher tradeoff by removing a large amount of high-quality CCS reads. A QV-30 cutoff may be more balanced.

Figure 2

Figure 2

The distribution of pass number for 9812CCS MSG reads.

Figure 3

Figure 3

Read quality value (mean QV) vs CCS pass number. As the pass number increases the read quality value in general increases, however, the correlation is not linear, when the pass number getting higher, the increase of the read quality value slows down. (Note: the box plots for pass number greater than 15 were not shown in the figure due to insufficient data points).

Figure 4

Figure 4

Box plots of CCS read accuracy (ABIP) for different pass numbers. Figure 4a) shows the box plots for all 9812 reads without doing QC, most of the outliers denoted by red crosses are low-quality reads; Figure 4b) shows the box plots for the top 3000 reads ranked by mean QV. The low-quality reads for pass numbers lower than 7 have significantly been removed but none of those with pass numbers greater than 7 have been removed, meanwhile, no reads with a pass number of 2 have been selected and most good reads for low pass number have also been screened out with the high mean QV threshold. Figure 4c) Shows the box plots for the top 3000 reads ranked by the predicted accuracy value by SVM. Obviously, most of the low-quality reads have been cleaned and all of the reads selected are those with pass numbers less than 9. The two figures 4b) and 4c) illustrate the different effects of the two QC methods due to different ranking mechanisms.

References

    1. Eid J, Fehr A, Gray J, Luong K, Lyle J, et al. Real-time DNA sequencing from single polymerase molecules. Science. 2009;323:133–138. - PubMed
    1. Otto TD. Real-time sequencing. Nat Rev Microbiol. 2011;9:633. - PubMed
    1. Glenn TC. Field guide to next-generation DNA sequencers. Mol Ecol Resour. 2011;11:759–769. - PubMed
    1. Travers KJ, Chin CS, Rank DR, Eid JS, Turner SW. A flexible and efficient template format for circular consensus sequencing and SNP detection. Nucleic Acids Res. 2010;38:e159. - PMC - PubMed
    1. Quail MA, Smith M, Coupland P, Otto TD, Harris SR, et al. A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers. BMC Genomics. 2012;13:341. - PMC - PubMed

Grants and funding

LinkOut - more resources