Semi-automatic in silico gap closure enabled de novo assembly of two Dehalobacter genomes from metagenomic data - PubMed (original) (raw)

Semi-automatic in silico gap closure enabled de novo assembly of two Dehalobacter genomes from metagenomic data

Shuiquan Tang et al. PLoS One. 2012.

Abstract

Typically, the assembly and closure of a complete bacterial genome requires substantial additional effort spent in a wet lab for gap resolution and genome polishing. Assembly is further confounded by subspecies polymorphism when starting from metagenome sequence data. In this paper, we describe an in silico gap-resolution strategy that can substantially improve assembly. This strategy resolves assembly gaps in scaffolds using pre-assembled contigs, followed by verification with read mapping. It is capable of resolving assembly gaps caused by repetitive elements and subspecies polymorphisms. Using this strategy, we realized the de novo assembly of the first two Dehalobacter genomes from the metagenomes of two anaerobic mixed microbial cultures capable of reductive dechlorination of chlorinated ethanes and chloroform. Only four additional PCR reactions were required even though the initial assembly with Newbler v. 2.5 produced 101 contigs within 9 scaffolds belonging to two Dehalobacter strains. By applying this strategy to the re-assembly of a recently published genome of Bacteroides, we demonstrate its potential utility for other sequencing projects, both metagenomic and genomic.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1

Figure 1. Microbial composition determined by 16S rRNA pyrotag sequences from the ACT-3 parent culture and two subcultures.

The ACT-3 culture contains two Dehalobacter strains, strain CF50 and strain DCA. Strain CF50 was inherited by the CF subculture, expressing reductive dehalogenase CfrA, which dechlorinates CF and 1,1,1-TCA. Strain DCA was inherited by the DCA subculture, expressing reductive dehalogenase DcrA, which dechlorinates 1,1-DCA. The microbial composition was determined by pyrotag sequencing of the 16S rRNA gene .

Figure 2

Figure 2. Overview of the in silico gap-resolution process.

(a) The principle of the perl program that automates the search for overlapping contigs that close an assembly gap. (b) A typical output of the perl program; shown is the case of gap 00973-G-00974; (c) The solutions to gap 00973-G-00974 represented as a multiple sequence alignment created and visualized with Geneious Pro.

Figure 3

Figure 3. Separation of the genome of strain CF50 by progressive read-mapping.

(a) the result of 1st read mapping against the draft reference genome. (b) The result of last read mapping against the refined reference genome. Illumina read pairs from the CF metagenome, which only has the genome of strain CF50, were mapped against a reference genome derived from a chimeric Dehalobacter genome from the ACT-3 metagenome, which has both strain CF50 and strain 11DCA. The progressive read-mapping process as described resulted in the refined genome (Figure 2b), representing the genome of strain CF50. Regions that have coverage lower than 5x are highlighted in red. The read depth is highlighted in green when both DNA strands were covered and in yellow when only one strand was covered.

Figure 4

Figure 4. Contig distribution in the ACT-3 metagenome.

Based on average read depth, the contigs were grouped into 4 regions. Region A: multi-copy contigs in the Dehalobacter genomes (read depth>90); Region B: contigs shared by both Dehalobacter strains (red depth ∼70); Region C: contigs specific to each Dehalobacter strain (red depth ∼35); Region D: contigs that belong to other organisms of lower abundance (red depth<20).

Figure 5

Figure 5. Typical gaps in Group A.

(a) The resolution of gap 00237-G-00238. (b) The resolution of gap 00240-G-00241. (c) The sequence alignment of the consensus sequences of gap 00237-G-00238 and gap 00240-G-00241. All DNA sequence alignments (including those in other figures) were generated with Geneious Pro, having the same format. As shown in Figure 5a, most sequence identifiers consist of three regions. Region 1 shows the ID of the sequence. Region 2 indicates some specific tags: “W” means the sequence is the last 1000 bp nucleotides adapted from the 3′ end of the contig, and it is on the

w

est side of the gap; “E” means the sequence is the first 1000 bp adapted from the 5′ end of the contig, and it is on the

e

ast of the gap; “F” means the sequence is a whole contig and in its

f

orward orientation; “R” means the sequence is a whole contig but in its

r

everse orientation. Region 3 shows the average read depth of the contig from which the sequence is derived. The sequence alignment is shown on the right hand side. Marks on the top show the scale; the alignment mismatches are highlighted in black and the matches in grey; gaps in sequences are indicated in dashes. In some Figures (e.g., Figures 7, 10, 14) the identity of the overlapping sequences is shown on top of the alignment as a coloured bar; positions with 100% identity are in green and positions with lower identity are in yellow.

Figure 6

Figure 6. Typical gaps in Group B.

Five gaps caused by the presence of a multi-copy contig, contig01468 are shown. Notably, although part of contig01468 is shared by all five gaps, the terminal part on the 5′ edge of contig01468F (highlighted with rectangles) only belongs in the last gap. It would be more reasonable to assemble the raw reads in this region to contig03616, but Newbler was not smart enough to do so. The consequence is that this kind of poor overlap (as shown in the first four gaps) prevailed in the resolution of gaps caused by multi-copy contigs. Accordingly, these poorly overlapping edges of the multi-copy contigs were trimmed in the construction of consensus solutions.

Figure 7

Figure 7. Typical gaps in Group C.

(a) The resolution of gap 00289-G-00290. (b) The resolution of gap 00290-G-00291. In Figure 7a and 7b, “pairs of alternative contigs” are highlighted in single brackets; contig01244 and contig01245 are highlighted with an asterisk. (c) The schematic graph showing the relationship between scaffold003 and scaffold129. Contigs are represented by straight lines with contig ID on the top and average read depth at the bottom; curved arrows indicate scaffolding relationships.

Figure 8

Figure 8. Typical gaps in Group D.

(a) The insertion or deletion of contig00271. (b) The insertion or deletion of contig01388. The sequences highlighted with an asterisk are raw reads that are suppressed at the edges of different contigs. Sequence edges that are highlighted in rectangles should be trimmed in generating consensus sequences.

Figure 9

Figure 9. Assessment of the assembly using gap-distance comparisons.

When the preceeding contig and the succeeding contig overlapped directly with each other, the gap distance was negative with the value equal to the length of the overlapped region. However, all gap distances calculated from Newbler were positive and the minimum value was 20 (the details of Newbler’s calculation are unknown). This explains why some gaps locate below the horizontal axis. Most gaps from Group D have insertion or deletion of a multi-copy sequence: insertion in one strain and deletion in the other strain. The gap distance based on insertion is longer than the one based on deletion. For simplicity, we calculated gap distance assuming insertion, while Newbler’s estimations should be average values between the gap distance in the case of deletion and the one in the case of insertion, depending on the mate pairs used for calculation. This likely explains why most gaps from Group D locate above the diagonal line. The gap distances for gaps 00285-G-00286 and 00239-G-00240 (highlighted by arrows) are consistent with mate-pair predictions if one assumes the existence of the tandem copies of the multi-copy sequences involved.

Figure 10

Figure 10. Combinations of alternative scaffolds.

(a) The combination of scaffold095 and scaffold054. (b) Traces of homology between contig00974 and contig01154R. (c) Traces of homology between contig00975 and contig01153R. (d) The combination of scaffold041 and scaffold003. Contigs are represented by straight lines with contig ID on the top and average read depth at the bottom; curved arrows indicate scaffolding relationships.

Figure 11

Figure 11. Schematic of the draft chimeric Dehalobacter genome from the ACT-3 metagenome.

The major scaffolds and contigs are represented as straight lines with contig and scaffold IDs labeled; contigs shared by both strains are in blue; contigs specific to strain CF50 are in read; contigs specific to strain DCA are in green.

Figure 12

Figure 12. Gap 00229-G-00230.

(a) The sequence alignment of related contigs. (b) The dot plot of the consensus sequence from (a) against itself.

Figure 13

Figure 13. The incomplete resolution of three gaps in which 5S, 16S, and 23S rRNA genes locate.

Straight lines represent contigs with contig IDs (top) and average read depth (bottom) indicated. The three lines between contig01315 and contig05122 indicate three potential connections between them; and the two lines between contig01504 and contig01997 indicate two potential connections.

Figure 14

Figure 14. The alignment of the two Dehalobacter genomes: strain CF50 and strain DCA.

Figure 15

Figure 15. Alignment of the published assembly versus the new (this study) assembly of the B. salanitronis genome.

The positions of assembly gaps caused by the 6 copies of the rRNA operons are indicated as Region 1–6. Region 7 and 8 indicate the two large regions of disagreement.

Similar articles

Cited by

References

    1. Mardis E, McPherson J, Martienssen R, Wilson RK, McCombie WR (2002) What is finished, and why does it matter. Genome Res 12: 669–671. - PubMed
    1. Gordon D, Desmarais C, Green P (2001) Automated finishing with Autofinish. Genome Res 11: 614–625. - PMC - PubMed
    1. Assefa S, Keane TM, Otto TD, Newbold C, Berriman M (2009) ABACAS: algorithm-based automatic contiguation of assembled sequences. Bioinformatics 25: 1968–1969. - PMC - PubMed
    1. Chain PSG, Grafham DV, Fulton RS, FitzGerald MG, Hostetler J, et al. (2009) Genome project standards in a new era of sequencing. Science 326: 236–237. - PMC - PubMed
    1. Kingsford C, Schatz MC, Pop M (2010) Assembly complexity of prokaryotic genomes using short reads. BMC Bioinformatics 11: 21. - PMC - PubMed

Publication types

MeSH terms

Substances

Grants and funding

Metagenome sequencing of the ACT-3 culture was provided by the U.S. Deparment of Energy Joint Genome Institute through the Community Sequencing Program (CSP 2010). Support was provided by the Government of Canada through Genome Canada and the Ontario Genomics Institute (2009-OGI-ABC-1405). Support was also provided by the Government of Ontario through the ORF-GL2 program and the United States Department of Defense through the Strategic Environmental Research and Development Program (SERDP) under contract W912HQ-07-C-0036 (project ER-1586). S.T. received awards from the Government of Ontario through the Ontario Graduate Scholarships in Science and Technology (OGSST) and the Natural Sciences and Engineering Research Council of Canada (NSERC PGS B). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

LinkOut - more resources