NASP: an accurate, rapid method for the identification of SNPs in WGS datasets that supports flexible input and output formats - PubMed (original) (raw)

. 2016 Aug 25;2(8):e000074.

doi: 10.1099/mgen.0.000074. eCollection 2016 Aug.

Darrin Lemmer 1, Jason Travis 1, James M Schupp 1, John D Gillece 1, Maliha Aziz 3, Elizabeth M Driebe 1, Kevin P Drees 4, Nathan D Hicks 5, Charles Hall Davis Williamson 2, Crystal M Hepp 2, David Earl Smith 1, Chandler Roe 1, David M Engelthaler 1, David M Wagner 2, Paul Keim 2

Affiliations

NASP: an accurate, rapid method for the identification of SNPs in WGS datasets that supports flexible input and output formats

Jason W Sahl et al. Microb Genom. 2016.

Abstract

Whole-genome sequencing (WGS) of bacterial isolates has become standard practice in many laboratories. Applications for WGS analysis include phylogeography and molecular epidemiology, using single nucleotide polymorphisms (SNPs) as the unit of evolution. NASP was developed as a reproducible method that scales well with the hundreds to thousands of WGS data typically used in comparative genomics applications. In this study, we demonstrate how NASP compares with other tools in the analysis of two real bacterial genomics datasets and one simulated dataset. Our results demonstrate that NASP produces similar, and often better, results in comparison with other pipelines, but is much more flexible in terms of data input types, job management systems, diversity of supported tools and output formats. We also demonstrate differences in results based on the choice of the reference genome and choice of inferring phylogenies from concatenated SNPs or alignments including monomorphic positions. NASP represents a source-available, version-controlled, unit-tested method and can be obtained from tgennorth.github.io/NASP.

Keywords: Phylogeography; SNPs; bioinformatics.

PubMed Disclaimer

Figures

Fig. 1.

Fig. 1.

Workflow of the NASP pipeline.

Fig. 2.

Fig. 2.

NASP benchmark comparisons of walltime (a) and RAM (b) on a set of Escherichia coli genomes. For the walltime comparisons, 3520 E. coli genomes were randomly sampled ten times at different depths and run on a server with 856 cores. Only the matrix-building step is shown, but demonstrates a linear scaling with the processing of additional genomes.

Fig. 3.

Fig. 3.

Dendrogram of tree building methods on a simulated set of mutations in the genome of Yersinia pestis Colorado 92. The topological score was generated by compare2trees (Nye et al., 2006) compared with a maximum likelihood phylogeny inferred from a set of 3501 SNPs inserted by Tree2Reads. The dendrogram was generated with the neighbor-joining method in the Phylip software package (Felsenstein, 2005).

Similar articles

Cited by

References

    1. Aberer A. J., Kobert K., Stamatakis A.(2014). ExaBayes: massively parallel Bayesian tree inference for the whole-genome era. Mol Biol Evol 312553–2556.10.1093/molbev/msu236 - DOI - PMC - PubMed
    1. Angiuoli S. V., Salzberg S. L.(2011). Mugsy: fast multiple alignment of closely related whole genomes. Bioinformatics 27334–342.10.1093/bioinformatics/btq665 - DOI - PMC - PubMed
    1. Bertels F., Silander O. K., Pachkov M., Rainey P. B., van Nimwegen E.(2014). Automated reconstruction of whole-genome phylogenies from short-sequence reads. Mol Biol Evol 311077–1088.10.1093/molbev/msu088 - DOI - PMC - PubMed
    1. Blattner F. R., Plunkett G., Bloch C. A., Perna N. T., Burland V., Riley M., Collado-Vides J., Rode C. K., Rode C. K., et al. (1997). The complete genome sequence of Escherichia coli K-12. Science 2771453–1462.10.1126/science.277.5331.1453 - DOI - PubMed
    1. Bolger A. M., Lohse M., Usadel B.(2014). Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 302114–2120.10.1093/bioinformatics/btu170 - DOI - PMC - PubMed

Data Bibliography

    1. Cui, Y. Sequence Read Archive. SRA010790 (2013).

MeSH terms

LinkOut - more resources