Paired de bruijn graphs: a novel approach for incorporating mate pair information into genome assemblers - PubMed (original) (raw)

Paired de bruijn graphs: a novel approach for incorporating mate pair information into genome assemblers

Paul Medvedev et al. J Comput Biol. 2011 Nov.

Abstract

The recent proliferation of next generation sequencing with short reads has enabled many new experimental opportunities but, at the same time, has raised formidable computational challenges in genome assembly. One of the key advances that has led to an improvement in contig lengths has been mate pairs, which facilitate the assembly of repeating regions. Mate pairs have been algorithmically incorporated into most next generation assemblers as various heuristic post-processing steps to correct the assembly graph or to link contigs into scaffolds. Such methods have allowed the identification of longer contigs than would be possible with single reads; however, they can still fail to resolve complex repeats. Thus, improved methods for incorporating mate pairs will have a strong effect on contig length in the future. Here, we introduce the paired de Bruijn graph, a generalization of the de Bruijn graph that incorporates mate pair information into the graph structure itself instead of analyzing mate pairs at a post-processing step. This graph has the potential to be used in place of the de Bruijn graph in any de Bruijn graph based assembler, maintaining all other assembly steps such as error-correction and repeat resolution. Through assembly results on simulated perfect data, we argue that this can effectively improve the contig sizes in assembly.

PubMed Disclaimer

Figures

FIG. 1.

FIG. 1.

Mate pairs and the de Bruijn graph. (a) A mate pair is a pair of reads with a distance of d between their start positions. (b) A circular genome S and two mate pairs, with d = 4 and d = 5. (c, d) The de Bruijn graph construction for k = 2. (c) The outside circle shows a separate black edge for each 3-mer (equivalently, each element of the 3-spectrum). The dotted red lines indicate vertices that will be glued. The inner circle shows the result of applying some of the glues. Note that this is an intermediate step of the construction in which we only show the gluings of vertices arising from the same position of S. (d) The final de Bruijn graph, resulting from all the glues.

FIG. 2.

FIG. 2.

The effect of increasing k and d. (a) The number of repeated _k_-mers in the E. coli genome, for various values of k. (b) The number of repeated (k, d)-mers, for various values of d with k = 50.

FIG. 3.

FIG. 3.

The (approximate) paired de Bruijn graph: (a, b) The paired de Bruijn construction for k = 2, d = 4 from the same string S as in Fig. 1. In (a), the outer circle has an edge from every element of the (3, 4)-spectrum. (b) The paired de Bruijn graph after all the gluings; notice that it has only one branching vertex, versus four in the de Bruijn graph (Fig. 1d). (c–e) The construction of the approximate paired de Bruijn graph for k = 2, d = 5, Δ = 1. In (c), one possible covering spectrum is shown in the outside circle, with black edges for elements with mate pair distance 6 and blue edges for distance 5. Since Δ = 1, we glue vertices if they have equal left labels and their right labels are a distance at most 2 apart from each other in the de Bruijn graph (Fig. 1d). The final multigraph after all vertex gluings is shown in (d), and the resulting simple graph, used to spell the contigs, is shown in (e). Notice that this graph now has three branching vertices.

FIG. 4.

FIG. 4.

Example of the standard and paired de Bruijn graphs: The reads are the (5,12)-spectrum generated from the cyclic sequence ATCGGGATGACTATGTCGCTCCTAATCGGGAAGACTATGCCGCTCCTT. (a) The de Bruijn graph with edges constructed from the set of 5-mers in the (5,12) spectrum. Each node is a rectangle labeled by a 4-mer with the node ID shown as a large red number on the left of the node. The mate pair information is also presented in the graph: for each node, the node IDs of its corresponding right 4-mers are shown as small numbers on the right of the rectangle. For instance, the right 4-mers (blue dotted lines) of CGGG (node 3) are GTCG (node 21) and GCCG (node 22) and we write 21 and 22 on the right side of node 3. Note that there is not a single mate pair with a unique path between the mates, making mate pair transformations impossible. (b) The paired de Bruijn graph from the (5,12) spectrum is a cycle, representing a single contig. In this example, the paired approach allows for longer contigs than would mate pair transformations (though there are also examples when the opposite is true).

FIG. 5.

FIG. 5.

Contig lengths. Cumulative contig lengths (for standard and paired de Bruijn graphs) on simulated data with perfect coverage. Contigs are sorted in order from largest to smallest. Point (x, y) means the largest x contigs have cumulative length y. (a) To analyze the effect of the insert size (IS) on the assembly, we kept the read length fixed at 50, but varied the insert size. We also generated non-paired reads of length 50. For E. coli, the curve for insert size 6000 is not shown because there was only one contig, representing the whole genome. (b) To analyze the effect of read length on contig lengths, we fixed the insert size to 1000 but varied the read length. We also generated non-paired reads of length 1000, giving an upper bound on how good the assembly can be in this case. (c) To analyze the effect of variations in the insert size (Δ), we fixed the mean insert size (1000) and read length (50). We also show the baseline contig lengths in a non-paired dataset, with read length 50 and perfect coverage.

Similar articles

Cited by

References

    1. Batzoglou S. Jaffe D.B. Stanley K., et al. ARACHNE: a whole-genome shotgun assembler. Genome Res. 2002;12:177–189. - PMC - PubMed
    1. Bentley D.R. Balasubramanian S. Swerdlow H.P., et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature. 2008;456:53–59. - PMC - PubMed
    1. Butler J. MacCallum I. Kleber M., et al. ALLPATHS: de novo assembly of whole-genome shotgun microreads. Genome Res. 18:810–820. - PMC - PubMed
    1. Chaisson M.J. Pevzner P.A. Short read fragment assembly of bacterial genomes. Genome Res. 2008;18:324–330. - PMC - PubMed
    1. Chaisson M.J. Brinza D. Pevzner P.A. De novo fragment assembly with short mate-paired reads: does the read length matter? Genome Res. 2009;19:336–346. - PMC - PubMed

Publication types

MeSH terms

LinkOut - more resources