Advantages of Single-Molecule Real-Time Sequencing in High-GC Content Genomes - PubMed (original) (raw)

Advantages of Single-Molecule Real-Time Sequencing in High-GC Content Genomes

Seung Chul Shin et al. PLoS One. 2013.

Abstract

Next-generation sequencing has become the most widely used sequencing technology in genomics research, but it has inherent drawbacks when dealing with high-GC content genomes. Recently, single-molecule real-time sequencing technology (SMRT) was introduced as a third-generation sequencing strategy to compensate for this drawback. Here, we report that the unbiased and longer read length of SMRT sequencing markedly improved genome assembly with high GC content via gap filling and repeat resolution.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: JEL is an employee of DNALink, Inc. There are no patents, products in development or marketed products to declare. This does not alter the authors' adherence to all the PLOS ONE policies on sharing data and materials.

Figures

Figure 1

Figure 1. Statistics of error-corrected reads.

(a) The length distribution of CLRs and PBcRs. Error correction of CLRs with Illumina short reads (50×, 100× and 200× coverage) showed similar length distributions. Larger numbers of Illumina short reads did not improve the results of error correction in the mean length of reads and throughput, but CCS reads increased both in mean length and throughput. (b) CCS increased the throughput of error correction by joining the break positions with no short-read coverage. (c) Base qualities of CLRs and PBcRs, where the x-axis correspnds to base position and the y-axis to the average Phred quality score.

Figure 2

Figure 2. Results of error correction using 50× SR and 16× CCS reads.

HAWKEYE indicated how to correct the errors of CLR with SRs (blue) and CCS reads (red). The numbers indicate the regions aligned with only CCS reads. CCS reads improved the throughput of error correction by spanning the unaligned region by SRs.

Figure 3

Figure 3. Streptomyces sp. PAMC 26508 assembly.

(a) The outermost track (pink) represents the complete genome sequence of Streptomyces sp. PAMC 26508, the middle track (red) represents assembly with PBcRSR(50×)+CCS, the inner track (blue) represents assembly with PBcRSR(50×) and the next track (green) represents assembly with SR. The innermost track (red line) indicates the read coverage of assembled contigs with PBcRSR(50×)+CCS. The numbers along the track indicate kilobase coordinates along the contig. The highlighted region H01 indicates the region of mis-assembled contig by repeat (Fig. 3b) and the highlighted region H02 indicates the representative region showing the differences in assemblies (Fig. 3c). (b) Red arrow indicates interspersed repeat sequences of the integrase gene. Contigs assembled from SRs(100×) with short read length were mis-assembled and split into three contigs by two integrase genes with identical sequences (600 bp long), but both PBcRSR(50×) and PBcRSR(50×)+CCS could resolve repeats due to their ability to span repeats. (c) The box indicates two types of gap: the black box indicates the gaps generated by assembly with both SRs(100×) and PBcRs reads, and the yellow box indicates the gaps generated by assembly with only SRs(100×) reads. Black line is GC content, and green, blue and red lines are each coverage, respectively. Each coverage and the average GC content for 25 base window of the flanking 1-kb region of gaps in assemblies. Gaps generated by assembly using short reads were filled with sufficient coverage of PBcRs, and PBcRSR(50×)+CCS was able to span more gaps than PBcRSR(50×). The local GC content of gaps is relatively higher than contigs.

Figure 4

Figure 4. Dot plot showed that the assembly PBcRSR(50×)+CCS+454 was more accurate than other assembies.

SRs(100×)+454 to the contigs assembled with PBcRs. (a) contigs of the assembly SRs(100×)+454 vs. contigs of PBcRSR(50×)+CCS+454. (b) contigs of the assembly SRs(100×)+454 vs. contigs of PBcRSR(50×)+454. Horizontal and vertical dotted lines indicate the boundaries of each contig. The red contig number indicate the mis-assembled contigs, and the blue contig number and rectangle indicate the region of mis-assembled contigs in Fig. 3b. (c) PCR validation of disagreements between Illumina short-read assembly and PBcR assembly (V1∼V7). Amplified V1∼V7 products showed that the contigs of the assembly SRs(100×)+454 were mis-assembled. (d) Contig 551 in the assembly PBcRSR(50×)+454 was confirmed to be mis-assembled in the region of ribosomal RNA operons with amplified V8 and V9 product. (e) The region of mis-assembled contig in Fig. 3b (indicated in blue rectangle of a and b) were validated by PCR: integrase 1 (lane1) and integrase 2 (lane2).

Figure 5

Figure 5. PBcRs resolved the collapsed tandem repeat in the chromosome of Streptomyces sp. PAMC26508.

(a) The region of tandem repeats was amplified by PCR and sequenced. The tandem repeat was mis-assembled in the assembly SRs(100×)+454 due to the short length, but PBcRs resolved the tandem repeat by spanning the entire region. (b) The dot plot shows alignment of PCR product to the contig of PBcRSR(50×)+CCS+454. (c) The dot plot shows the alignment of PCR product to the contig of SRs(100×)+454.

References

    1. Aird D, Ross MG, Chen WS, Danielsson M, Fennell T, et al. (2011) Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries. Genome Biol 12: R18. - PMC - PubMed
    1. Niu B, Fu L, Sun S, Li W (2010) Artificial and natural duplicates in pyrosequencing reads of metagenomic data. BMC Bioinformatics 11: 187. - PMC - PubMed
    1. Dohm JC, Lottaz C, Borodina T, Himmelbauer H (2008) Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucleic Acids Res 36: e105. - PMC - PubMed
    1. Rasko DA, Webster DR, Sahl JW, Bashir A, Boisen N, et al. (2011) Origins of the E. coli strain causing an outbreak of hemolytic-uremic syndrome in Germany. N Engl J Med 365: 709–717. - PMC - PubMed
    1. Schadt EE, Turner S, Kasarskis A (2010) A window into third-generation sequencing. Hum Mol Genet 19: R227–240. - PubMed

Publication types

MeSH terms

Grants and funding

This work was supported by a Functional Genomics on Polar Organisms grant (PE13020) funded by the Korea Polar Research Institute (KOPRI). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

LinkOut - more resources