NOVOPlasty: de novo assembly of organelle genomes from whole genome data - PubMed (original) (raw)

NOVOPlasty: de novo assembly of organelle genomes from whole genome data

Nicolas Dierckxsens et al. Nucleic Acids Res. 2017.

Abstract

The evolution in next-generation sequencing (NGS) technology has led to the development of many different assembly algorithms, but few of them focus on assembling the organelle genomes. These genomes are used in phylogenetic studies, food identification and are the most deposited eukaryotic genomes in GenBank. Producing organelle genome assembly from whole genome sequencing (WGS) data would be the most accurate and least laborious approach, but a tool specifically designed for this task is lacking. We developed a seed-and-extend algorithm that assembles organelle genomes from whole genome sequencing (WGS) data, starting from a related or distant single seed sequence. The algorithm has been tested on several new (Gonioctena intermedia and Avicennia marina) and public (Arabidopsis thaliana and Oryza sativa) whole genome Illumina data sets where it outperforms known assemblers in assembly accuracy and coverage. In our benchmark, NOVOPlasty assembled all tested circular genomes in less than 30 min with a maximum memory requirement of 16 GB and an accuracy over 99.99%. In conclusion, NOVOPlasty is the sole de novo assembler that provides a fast and straightforward extraction of the extranuclear genomes from WGS data in one circular high quality contig. The software is open source and can be downloaded at https://github.com/ndierckx/NOVOPlasty.

PubMed Disclaimer

Figures

Figure 1.

Figure 1.

Coverage depth for a 12 000 bp long region of the mitochondrial genome of Gonioctena intermedia. There are several regions with a low GC content, resulting in a reduced read coverage.

Figure 2.

Figure 2.

Work flow of NOVOPlasty. For simplicity the work flow was limited to unidirectional extension. (A) All reads are stored in a hash table with a unique id. A second hash table contains the ids for the read start = k-mer parameter (default = 38) of the corresponding read. (B) Scope of search 1 is the region where a match of the ‘read start’ indicates a extension of the sequence. All these matching reads are stored separately. (C) The position of the paired reads are verified by aligning each paired read to a previous assembled area, which is determined by the library insert size (scope of search 2). (D) A consensus sequence of the different extensions is determined.

Figure 3.

Figure 3.

Comparison between the NOVOPlasty and CLC alignments of three different chloroplast assemblies against their respective reference. (A) CLC and NOVOPlasty assemblies of SRR1174256 (A. thaliana) against GenBank entry AP000423.1. (B) CLC and NOVOPlasty assemblies of ERR477442 (O. sativa) against GenBank entry KM088022.1. (C) CLC assembly of A. marina against the manually inspected NOVOPlasty assembly.

Figure 4.

Figure 4.

Score graph derived from the benchmark study. Each property of each assembler was given a score proportional to the other assemblers. Each score was based on the average results of seven assemblies and expressed in percentage. A score of 100% is always seen as most favorable, more detailed explanation can be found in the ‘Quality assessment’ section of Materials and Methods. (*) Highest score for the corresponding property.

Figure 5.

Figure 5.

Seed compatibility test for the de novo assembly of the human mitochondrium with 12 different mitochondrial genomes as seed sequence. A green dot means that the mitochondrial genome of that species can be used as a seed for the mitochondrial assembly of H. sapiens. Red X means unsuccessful. Phylogenetic tree based on information extracted from the NCBI taxonomy database (20), using phyloT (

http://phylot.biobyte.de/

).

Figure 6.

Figure 6.

Seed compatibility test for the de novo assembly of the chloroplast from Arabidopsis thaliana with 12 different chloroplast genomes and 12 subunits (RuBP) as a seed sequence. A green dot means that the chloroplast genome of that species can be used as a seed for the chloroplast assembly of A. thaliana. Red M indicates that NOVOPlasty assembled the mitochondrial genome instead of the chloroplast genome. Same color indications for the RuBP unit. Phylogenetic tree based on information extracted from the NCBI taxonomy database (20), using phyloT (

http://phylot.biobyte.de/

).

Similar articles

Cited by

References

    1. Brozynska M., Furtado A., Henry R.J.. Direct chloroplast sequencing: Comparison of sequencing platforms and analysis tools for whole chloroplast barcoding. PLoS One. 2014; 9:e110387. - PMC - PubMed
    1. Bignell G.R., Miller A.R., Evans I.H.. Isolation of mitochondrial DNA. Methods Mol. Biol. 1996; 53:109–106. - PubMed
    1. Jansen R.K., Raubeson L.A., Boore J.L., dePamphilis C.W., Chumley T.W., Haberle R.C., Wyman S.K., Alverson A.J., Peery R., Herman S.J.. Elizabeth AZ, Eric HR. Methods for obtaining and analyzing whole chloroplast genome sequences. Methods in Enzymology. 2005; Academic Press; 348–384. - PubMed
    1. Khan A., Khan I.A, Asif H., Azim M.K.. Current trends in chloroplast genome research. Afr. J. Biotechnol. 2010; 9:3494–3500.
    1. Wu J., Liu B., Cheng F., Ramchiary N., Choi S.R., Lim Y.P., Wang X-W.. Sequencing of chloroplast genome using whole cellular DNA and Solexa sequencing technology. Front. Plant Sci. 2012; 3:243. - PMC - PubMed

Publication types

MeSH terms

LinkOut - more resources