Unlocking short read sequencing for metagenomics - PubMed (original) (raw)

Unlocking short read sequencing for metagenomics

Sébastien Rodrigue et al. PLoS One. 2010.

Abstract

Background: Different high-throughput nucleic acid sequencing platforms are currently available but a trade-off currently exists between the cost and number of reads that can be generated versus the read length that can be achieved.

Methodology/principal findings: We describe an experimental and computational pipeline yielding millions of reads that can exceed 200 bp with quality scores approaching that of traditional Sanger sequencing. The method combines an automatable gel-less library construction step with paired-end sequencing on a short-read instrument. With appropriately sized library inserts, mate-pair sequences can overlap, and we describe the SHERA software package that joins them to form a longer composite read.

Conclusions/significance: This strategy is broadly applicable to sequencing applications that benefit from low-cost high-throughput sequencing, but require longer read lengths. We demonstrate that our approach enables metagenomic analyses using the Illumina Genome Analyzer, with low error rates, and at a fraction of the cost of pyrosequencing.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1

Figure 1. Size-dependent isolation of DNA fragments from sheared genomic DNA via dSPRI.

AMPure XP SPRI beads bind DNA fragments in a size dependent manner according to the concentration of salts and polyethylene glycol (PEG) in the reaction –, which can easily be changed by using different volume ratios of DNA to SPRI bead solutions. A two-step procedure is employed to isolate targeted DNA size fractions. Panels A to H present Bioanalyzer DNA-1000 assays showing the sheared genomic DNA used as starting material (black), the larger size DNA fragments discarded in separation 1 (red), and the size fraction purified and recovered after separation 2 (blue). Panel I is a table summarizing the conditions and results displayed in panels A to H. All Bioanalyzer DNA-1000 traces after separation 1 (panel J), and after separation 2 (panel K), are respectively displayed on a graph for the conditions presented in panels A to H. The conditions displayed in panel H were used to obtaine the Illumina composite reads discussed in the text. The wider DNA fragment size distribution from panel H allowed to better analyze the effects of shorter versus longer overlapping regions on consensus reads.

Figure 2

Figure 2. Reproducibility of double-SPRI.

The panel shows DNA fragment size distributions as obtained by Bioanalyzer DNA-1000 assays. The curves represent the size fractions removed during the first separation step A) or recovered after the second separation B). The two size fractions were independently reproduced in 4 or 8 separation experiments. The curves in B) represent the libraries sequenced after dSPRI based size selection, adapter ligation and PCR enrichment. While concentrations (arbitrary fluorescence units) vary between reproduced libraries the range of removed or enriched DNA fragment sizes was highly reproducible. Panel c) shows the DNA fragment size distribution recovered after the second separation when using decreasing amounts sheared genomic DNA. dSPRI allows reliable size selection in a DNA concentration independent manner.

Figure 3

Figure 3. Quality of Illumina reads out to 143 bp.

A) The mean Phred quality of single reads is shown in the solid red lines, with error bars displaying quartiles. Read quality is highly variable toward the reads' ends. Read quality as a function of base pair is worse on the second mate of the pair. B) The mean and quartiles of Phred quality by base for the average-length composite read.

Figure 4

Figure 4. High-confidence alignment yield by insert lengths.

Distribution of insert lengths by aligning the original reads to the reference sequence (gray), and lengths of composite reads retained (red).

Figure 5

Figure 5. Quality of composite fragments after overlapping.

The mean Phred quality (and corresponding error rate) at each position of the read. We show the original Illumina reads and the composite fragments generated by overlapping. A) 143 bp, the length of the original Illumina read; B) 180 bp, the mean length of overlapped fragment in this library and C) 250 bp, roughly the mean length generated by 454-FLX technology.

Figure 6

Figure 6. Composite Illumina reads constitute a legitimate alternative to pyrosequencing for metagenomics studies.

DNA from a marine metagenomics sample was sequenced with our overlapping mate-pair approach. The composite reads were directly compared to 454-FLX sequences from the exact same sample. A) Fraction of reads that could be assigned to a taxon using the composite reads (mean read length 180 bp), the entire 454-FLX dataset (mean read length 207 bp), or longer 454-FLX reads (mean read length 254 bp). B) Comparison of taxon assignments using composite reads and 454-FLX pyrosequencing reads. The top 25 represented taxa, with colored symbols next to the bracket, are listed in Table S1.

Figure 7

Figure 7. Short insertions and deletions in low-confidence composite Illumina reads.

Mate-reads from the control lane (PhiX174 bacteriophage genome) were used to assess false alignments introduced by the SHERA pipeline. After constructing the composite sequences and filtering, we plotted the difference in length (if any) between a composite fragment and its insert length as predicted by MAQ by mapping the original mate-pairs to the reference.This histogram is a sum of two distributions, the overlapper software's misalignments (a broad gaussian) and a sharp peak of small (1–2 bp) indels. We used a simple and conservative linear model to remove the indels and infer the number of misalignments.

Similar articles

Cited by

References

    1. Schuster SC. Next-generation sequencing transforms today's biology. Nat Methods. 2008;5:16–18. - PubMed
    1. Hiatt JB, Patwardhan RP, Turner EH, Lee C, Shendure J. Parallel, tag-directed assembly of locally derived short sequence reads. Nat Methods. 2010;7:119–122. - PMC - PubMed
    1. Sorber K, Chiu C, Webster D, Dimon M, Ruby JG, et al. The long march: a sample preparation technique that enhances contig length and coverage by high-throughput short-read sequencing. PLoS One. 2008;3:e3495. - PMC - PubMed
    1. Lennon NJ, Lintner RE, Anderson S, Alvarez P, Barry A, et al. A scalable, fully automated process for construction of sequence-ready barcoded libraries for 454. Genome Biol. 2010;11:R15. - PMC - PubMed
    1. DeAngelis MM, Wang DG, Hawkins TL. Solid-phase reversible immobilization for the isolation of PCR products. Nucleic Acids Res. 1995;23:4742–4743. - PMC - PubMed

Publication types

MeSH terms

LinkOut - more resources