Improving draft assemblies by iterative mapping and assembly of short reads to eliminate gaps - PubMed (original) (raw)

Improving draft assemblies by iterative mapping and assembly of short reads to eliminate gaps

Isheng J Tsai et al. Genome Biol. 2010.

Abstract

Advances in sequencing technology allow genomes to be sequenced at vastly decreased costs. However, the assembled data frequently are highly fragmented with many gaps. We present a practical approach that uses Illumina sequences to improve draft genome assemblies by aligning sequences against contig ends and performing local assemblies to produce gap-spanning contigs. The continuity of a draft genome can thus be substantially improved, often without the need to generate new data.

PubMed Disclaimer

Figures

Figure 1

Figure 1

Overview of the IMAGE process. Step one, Illumina reads are aligned against the initial assembly. Step two, Illumina reads that align to contig ends, along with their non-aligning mate adjacent to gaps, are assembled into new contigs, which are subsequently mapped back to the initial assembly. Step three, Illumina reads are aligned against the updated assembly and the whole process is repeated iteratively until the gap is closed.

Figure 2

Figure 2

Statistics of sequences at closed gaps in the Echinococcus multilocularis assembly. (a) The frequency of length of newly inserted sequences at gaps. (b) The closed gap length is positively correlated with estimated gap length from the Arachne assembler (Pearson's r = 0.44, P < 0.001).

Figure 3

Figure 3

An example of a gap closed with two iterations of IMAGE in Plasmodium berghei. In the first iteration, IMAGE extended the contig consensus sequence from the right side of the gap, indicated by the green bar. In the second iteration, reads were aligned to the updated contig end. Local assembly of these reads along with their unaligned mates resulted in a new contig to completely close the gap, indicated by the red bar. The horizontal lines above the bars denote the Illumina reads realigned to the updated consensus sequence after each iteration. Below, a zoomed in plot shows the Illumina reads realigned against the closed gap.

Figure 4

Figure 4

Closing gaps in de novo assembly comprising only Illumina reads. Schematic diagram showing the comparison of the original velvet assembly (3 contigs a, b and c) and the improved assembly in Salmonella enterica. The improved assembly was aligned to the reference sequence with 99.8% identity. The two closed gaps shown were 100% identical to the reference sequence. Contigs are indicated by grey bars; gene annotations are indicated by yellow boxes. Vertical lines highlight the gaps that are filled by IMAGE in the improved contigs. Below, a coverage plot showing the relatively even depth of coverage of realigned Illumina reads at the improved assembly, indicating no signature of misassembly.

Similar articles

Cited by

References

    1. Harismendy O, Ng PC, Strausberg RL, Wang X, Stockwell TB, Beeson KY, Schork NJ, Murray SS, Topol EJ, Levy S, Frazer KA. Evaluation of next generation sequencing platforms for population targeted sequencing studies. Genome Biol. 2009;10:R32. doi: 10.1186/gb-2009-10-3-r32. - DOI - PMC - PubMed
    1. Zerbino DR, Birney E. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 2008;18:821–829. doi: 10.1101/gr.074492.107. - DOI - PMC - PubMed
    1. Simpson JT, Wong K, Jackman SD, Schein JE, Jones SJ, Birol I. ABySS: a parallel assembler for short read sequence data. Genome Res. 2009;19:1117–1123. doi: 10.1101/gr.089532.108. - DOI - PMC - PubMed
    1. Li R, Zhu H, Ruan J, Qian W, Fang X, Shi Z, Li Y, Li S, Shan G, Kristiansen K, Yang H, Wang J. De novo assembly of human genomes with massively parallel short read sequencing. Genome Res. 2009;20:265–272. doi: 10.1101/gr.097261.109. - DOI - PMC - PubMed
    1. Maccallum I, Przybylski D, Gnerre S, Burton J, Shlyakhter I, Gnirke A, Malek J, McKernan K, Ranade S, Shea TP, Williams L, Young S, Nusbaum C, Jaffe DB. ALLPATHS 2: small genomes assembled accurately and with high continuity from short paired reads. Genome Biol. 2009;10:R103. doi: 10.1186/gb-2009-10-10-r103. - DOI - PMC - PubMed

Publication types

MeSH terms

LinkOut - more resources