QUAST: quality assessment tool for genome assemblies (original) (raw)

Journal Article

,

1Algorithmic Biology Laboratory, St. Petersburg Academic University, Russian Academy of Sciences, St. Petersburg 194021, Russia and 2Department of Mathematics, University of California, San Diego, La Jolla, CA 92093-0112, USA

*To whom correspondence should be addressed.

Search for other works by this author on:

,

1Algorithmic Biology Laboratory, St. Petersburg Academic University, Russian Academy of Sciences, St. Petersburg 194021, Russia and 2Department of Mathematics, University of California, San Diego, La Jolla, CA 92093-0112, USA

Search for other works by this author on:

,

1Algorithmic Biology Laboratory, St. Petersburg Academic University, Russian Academy of Sciences, St. Petersburg 194021, Russia and 2Department of Mathematics, University of California, San Diego, La Jolla, CA 92093-0112, USA

Search for other works by this author on:

1Algorithmic Biology Laboratory, St. Petersburg Academic University, Russian Academy of Sciences, St. Petersburg 194021, Russia and 2Department of Mathematics, University of California, San Diego, La Jolla, CA 92093-0112, USA

Search for other works by this author on:

Received:

07 October 2012

Revision received:

11 February 2013

Accepted:

14 February 2013

Published:

19 February 2013

Cite

Alexey Gurevich, Vladislav Saveliev, Nikolay Vyahhi, Glenn Tesler, QUAST: quality assessment tool for genome assemblies, Bioinformatics, Volume 29, Issue 8, April 2013, Pages 1072–1075, https://doi.org/10.1093/bioinformatics/btt086
Close

Navbar Search Filter Mobile Enter search term Search

Abstract

Summary: Limitations of genome sequencing techniques have led to dozens of assembly algorithms, none of which is perfect. A number of methods for comparing assemblers have been developed, but none is yet a recognized benchmark. Further, most existing methods for comparing assemblies are only applicable to new assemblies of finished genomes; the problem of evaluating assemblies of previously unsequenced species has not been adequately considered. Here, we present QUAST—a quality assessment tool for evaluating and comparing genome assemblies. This tool improves on leading assembly comparison software with new ideas and quality metrics. QUAST can evaluate assemblies both with a reference genome, as well as without a reference. QUAST produces many reports, summary tables and plots to help scientists in their research and in their publications. In this study, we used QUAST to compare several genome assemblers on three datasets. QUAST tables and plots for all of them are available in the Supplementary Material, and interactive versions of these reports are on the QUAST website.

Availability: http://bioinf.spbau.ru/quast

Contact: gurevich@bioinf.spbau.ru

Supplementary information: Supplementary data are available at Bioinformatics online.

1 INTRODUCTION

Modern DNA sequencing technologies cannot produce the complete sequence of a chromosome. Instead, they generate large numbers of reads, ranging from dozens to thousands of consecutive bases, sampled from different parts of the genome. Genome assembly software combines the reads into larger regions called contigs. However, current sequencing technologies and software face many complications that impede reconstruction of full chromosomes, including errors in reads and large repeats in the genome.

Different assembly programs use different heuristic approaches to tackle these challenges, resulting in many differences in the contigs they output. This leads to the questions of how to assess the quality of an assembly and how to compare different assemblies.

Recently, there has been a lot of work on developing comprehensive ways to compare different assemblers.

Plantagora (Barthelson et al., 2011) is a web-based platform aimed at helping scientists view characteristics of the most popular sequencing strategies (including sequencing platforms and assembly software) for plant genomes. Plantagora has a well-designed interface to browse their database of evaluation results. Researchers may run the Plantagora assessment tool on their own assembly, but the results cannot be viewed through the friendly user-interface; instead, the user has to parse a large log file.

The Assemblathon competition (Earl et al., 2011) compared 41 de novo assemblies on >100 evaluation metrics. The Assemblathon assessment scripts are freely available, but they are highly focused on the genomes used in the competition, and normal users cannot easily apply them to other genomes.

Another freely available genome assembly assessment tool is GAGE (Salzberg et al., 2011). In Salzberg et al. (2011), it was used to evaluate several leading genome assemblers on four datasets. GAGE evaluates a set of metrics, including different types of misassembly errors (inversions, relocations and translocations).

Plantagora and GAGE can only be used to evaluate assemblies of datasets with a known reference genome; thus, they are not suitable for evaluating assemblies of previously unsequenced genomes. Additionally, GAGE can only be run on one dataset at a time; therefore, to compare multiple assemblers on the same dataset, one has to manually combine output from separate GAGE reports into a table.

We introduce QUAST, a new assembly quality assessment tool. QUAST evaluates a full range of metrics needed by various users. However, the number of metrics is not so large that it would become difficult to interpret all of them. The interface and visualizations are easy to use, representative and informative. QUAST can evaluate assembly quality even without a reference genome, so that researchers can assess the quality of assemblies of new species that do not yet have a finished reference genome. In addition, QUAST is rather fast, and its most time-consuming steps are parallelized; therefore, it can be effectively run on multi-core processors. See Supplementary Table S1 for QUAST’s performance on different genomes.

2 METHODS

2.1 Metrics

QUAST aggregates methods and quality metrics from existing software, such as Plantagora, GAGE, GeneMark.hmm (Lukashin and Borodovsky 1998) and GlimmerHMM (Majoros et al., 2004), and it extends these with new metrics. For example, the well-known _N_50 statistic can be artificially increased by concatenating contigs, at the expense of increasing the number of misassemblies; QUAST introduces a new statistic, _NA_50, to counter this.

QUAST uses the Nucmer aligner from MUMmer v3.23 (Kurtz et al., 2004) to align assemblies to a reference genome and evaluate metrics depending on alignments. QUAST also computes metrics that are useful for assessing assemblies of previously unsequenced species, whereas most other assembly assessment software require a reference genome.

We will split the metrics evaluated by QUAST into several groups. Most have been used in previous studies, but some are new to QUAST.

2.1.1 Contig sizes

The following metrics (except for NGx) can be evaluated with or without a reference genome. We also provide filtered versions of them, restricted to contigs of length above a specified minimum size, to exclude short contigs that may not be of much use.

2.1.2 Misassemblies and structural variations

The following metrics describe structural errors in the contigs. QUAST can evaluate them only with respect to a known reference genome. If the reference genome exactly matches the dataset being assembled, differences may be attributed to misassemblies by the software or to sequencing errors, such as chimeric reads. Sometimes one uses a reference genome that is related to but different than the dataset being sequenced. In this case, the differences may still be misassemblies, but they may also be true structural variations, such as rearrangements, large indels, different repeat copy numbers and so forth.

2.1.3 Genome representation and its functional elements

This section lists metrics evaluating genome representation in contigs and the number of assembled functional elements, such as genes and operons. Most of these require a reference genome.

2.1.4 Variations of N50 based on aligned blocks

The following metrics in QUAST are new, but they have similarities with GAGE’s ‘corrected Nx’ (Salzberg et al., 2011), Assemblathon’s ‘contig path Nx over alignment graph’ (Earl et al., 2011) and the ‘normalized N50’ (Makinen et al., 2012) metric. Here, we give short descriptions for these metrics. See the Supplementary Methods for more detailed information.

NAx (A stands for aligned; x ranges from 0–100): This is a combination of the well-known Nx metric and Plantagora’s number of misassemblies metric. It is computed in two steps. First, we break the contigs into aligned blocks. If a contig has misassembly breakpoints (per the previous definition from Plantagora), it is broken into multiple blocks at these breakpoints. Additionally, if there are unaligned regions within a contig, these regions are removed, and the contig is split into blocks. Next, we compute the ordinary Nx statistic on these blocks instead of on the original contigs.

NGAx: We break contigs into aligned blocks as described for NAx, and then we compute the NGx statistic (instead of Nx) on these blocks.

Both the NAx and NGAx metrics require a reference genome. If the reference genome is different than the sample being assembled, some breakpoints and indels may represent true structural differences.

2.2 Visualization

QUAST presents a number of statistics in graphical form and supports SVG, PNG and PDF formats. Sample plots are presented in the Supplementary Material. These plots are divided into several groups:

Alignment of single-cell E.coli assemblies to the reference genome. On all tracks, the x-axis is genome position. Top track: Read coverage on a logarithmic scale. The red curve shows coverage binned in 1000 bp windows. Blue positions on the x-axis have zero coverage, even if their bin has some coverage. Coverage is highly non-uniform, ranging from 0 to near 10 000. All other tracks: Comparison of positions of aligned contigs. Contigs that align correctly are coloured blue if the boundaries agree (within 2000 bp on each side) in at least half of the assemblies, and green otherwise. Contigs with misassemblies are broken into blocks and coloured orange if the boundaries agree in at least half of the assemblies, and red otherwise. Contigs are staggered vertically and are shown in different shades of their colour to distinguish the separate contigs, including small ones

Fig. 1.

Alignment of single-cell E.coli assemblies to the reference genome. On all tracks, the _x_-axis is genome position. Top track: Read coverage on a logarithmic scale. The red curve shows coverage binned in 1000 bp windows. Blue positions on the _x_-axis have zero coverage, even if their bin has some coverage. Coverage is highly non-uniform, ranging from 0 to near 10 000. All other tracks: Comparison of positions of aligned contigs. Contigs that align correctly are coloured blue if the boundaries agree (within 2000 bp on each side) in at least half of the assemblies, and green otherwise. Contigs with misassemblies are broken into blocks and coloured orange if the boundaries agree in at least half of the assemblies, and red otherwise. Contigs are staggered vertically and are shown in different shades of their colour to distinguish the separate contigs, including small ones

2.3 Comparing assemblers

In this study, we evaluated several of the leading genome assemblers on three datasets: E scherichia coli (a single-cell sample), H omo sapiens chromosome 14 and B ombus impatiens (the bumble bee, which at publication time does not have a finished assembly). The E. coli dataset and some of its assemblies are taken from Chitsaz et al. (2011). The SPAdes and IDBA-UD assemblies are new. All assemblies of H. sapiens and B. impatiens and both datasets are taken from Salzberg et al. (2011). In this article, we present some of QUAST’s comparison statistics and a sample plot comparing E. coli assemblies. See Supplementary Figures S3–S29 and Supplementary Tables S2–S8 for more plots and extended tables for E. coli and for comparisons of assemblers on the other two datasets.

2.3.1 Comparison of E.coli assemblies

The reference genome is E. coli str. K-12 substr. MG1655 (Blattner et al., 1997), available at the NCBI website. Gene annotations were taken from http://www.ecogene.org/.

We include several well-known assemblers designed for cultured bacterial datasets: EULER-SR (Pevzner et al., 2001), Velvet (Zerbino and Birney, 2008), and SOAPdenovo (Li et al., 2010). We also include several recently introduced assemblers that have been adapted or designed from scratch to handle single-cell datasets: Velvet-SC and EULER + Velvet-SC (Chitsaz et al., 2011), our assembler, SPAdes (Bankevich et al., 2012) and IDBA-UD (Peng et al., 2012).

Table 1 shows that SPAdes and IDBA-UD have the best results in almost all metrics. IDBA-UD assembled the largest contig (224 018 bp) and has the smallest number of contigs (283), but SPAdes has a larger NGA50 than IDBA-UD (99 913 versus 90 607 bp) and assembled a higher percentage of the genome (96.99 versus 95.90%). SPAdes also assembled the highest number of complete genes (4071 of 4324), with IDBA-UD a close second (4030). However, both SPAdes and IDBA-UD have more misassemblies than the three Velvet-based assemblers.

Table 1.

Comparison of assemblies of a single-cell sample of E.coli (for contigs formula bp)

Assembler No. of contigs NGA50 (bp) Largest (bp) Total (bp) Genome fraction (%) No. of misassemblies No. of complete genes
EULER-SR 610 26 580 140 518 4 306 898 86.54 19 3442
E+V-SC 396 32 051 132 865 4 555 721 93.58 2 3816
IDBA-UD 283 90 607 224 018 4 734 432 95.90 9 4030
SOAPdenovo 817 16 606 87 533 4 183 037 81.36 6 3060
SPAdes 532 99 913 211 020 4 975 641 96.99 11 4071
Velvet 310 22 648 132 865 3 517 182 75.53 2 3121
Velvet-SC 617 19 791 121 367 4 556 809 93.31 2 3662
Assembler No. of contigs NGA50 (bp) Largest (bp) Total (bp) Genome fraction (%) No. of misassemblies No. of complete genes
EULER-SR 610 26 580 140 518 4 306 898 86.54 19 3442
E+V-SC 396 32 051 132 865 4 555 721 93.58 2 3816
IDBA-UD 283 90 607 224 018 4 734 432 95.90 9 4030
SOAPdenovo 817 16 606 87 533 4 183 037 81.36 6 3060
SPAdes 532 99 913 211 020 4 975 641 96.99 11 4071
Velvet 310 22 648 132 865 3 517 182 75.53 2 3121
Velvet-SC 617 19 791 121 367 4 556 809 93.31 2 3662

The best value for each column is indicated in bold.

Table 1.

Comparison of assemblies of a single-cell sample of E.coli (for contigs formula bp)

Assembler No. of contigs NGA50 (bp) Largest (bp) Total (bp) Genome fraction (%) No. of misassemblies No. of complete genes
EULER-SR 610 26 580 140 518 4 306 898 86.54 19 3442
E+V-SC 396 32 051 132 865 4 555 721 93.58 2 3816
IDBA-UD 283 90 607 224 018 4 734 432 95.90 9 4030
SOAPdenovo 817 16 606 87 533 4 183 037 81.36 6 3060
SPAdes 532 99 913 211 020 4 975 641 96.99 11 4071
Velvet 310 22 648 132 865 3 517 182 75.53 2 3121
Velvet-SC 617 19 791 121 367 4 556 809 93.31 2 3662
Assembler No. of contigs NGA50 (bp) Largest (bp) Total (bp) Genome fraction (%) No. of misassemblies No. of complete genes
EULER-SR 610 26 580 140 518 4 306 898 86.54 19 3442
E+V-SC 396 32 051 132 865 4 555 721 93.58 2 3816
IDBA-UD 283 90 607 224 018 4 734 432 95.90 9 4030
SOAPdenovo 817 16 606 87 533 4 183 037 81.36 6 3060
SPAdes 532 99 913 211 020 4 975 641 96.99 11 4071
Velvet 310 22 648 132 865 3 517 182 75.53 2 3121
Velvet-SC 617 19 791 121 367 4 556 809 93.31 2 3662

The best value for each column is indicated in bold.

Figure 1 shows how the contigs align to the reference genome and reveals high similarity between some of the assemblies. E+V-SC, Velvet and Velvet-SC generated assemblies with dozens of similar contigs; this is natural because all of these assemblers are modifications of Velvet. The top track shows the read coverage along the genome. Velvet was not able to assemble low-coverage regions of the genome, whereas the assemblers designed for single-cell datasets (Velvet-SC, E+V-SC, SPAdes and IDBA-UD) did much better, although, of course, none of them can assemble the regions that literally have zero coverage.

3 CONCLUSION

Many assembly algorithms have been developed for the challenging problem of genome assembly from short reads. Our new open-access quality assessment tool QUAST will help scientists to assess different assembly software to choose the best pipeline for their research, and it will help developers of genome assemblers to improve their software and algorithms.

ACKNOWLEDGEMENTS

The authors would like to thank the SPAdes team (Bankevich et al., 2012) for productive collaboration, helpful comments and feedback on using our software. The authors are especially grateful to Andrey Prjibelski for his help in developing the plots in QUAST and to Dmitry Antipov for his help in testing QUAST.

Funding: Government of the Russian Federation (11.G34.31.0018); NIH (3P41RR024851-02S1).

Conflict of Interest: none declared.

REFERENCES

et al.

SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing

,

J. Comput. Biol.

,

2012

, vol.

19

(pg.

455

-

477

)

et al.

Plantagora: modeling whole genome sequencing and assembly of plant genomes

,

PLoS One

,

2011

, vol.

6

pg.

e28436

et al.

The complete genome sequence of Escherichia coli K-12

,

Science

,

1997

, vol.

277

(pg.

1453

-

1462

)

et al.

Analysis of intra-genomic GC content homogeneity within prokaryotes

,

BMC Genomics

,

2010

, vol.

11

pg.

464

et al.

Efficient de novo assembly of single-cell bacterial genomes from short-read data sets

,

Nat. Biotechnol.

,

2011

, vol.

29

(pg.

915

-

921

)

et al.

Assemblathon 1: a competitive assessment of de novo short read assembly methods

,

Genome Res.

,

2011

, vol.

21

(pg.

2224

-

2241

)

et al.

Versatile and open software for comparing large genomes

,

Genome Biol.

,

2004

, vol.

5

pg.

R12

et al.

De novo assembly of human genomes with massively parallel short read sequencing

,

Genome Res.

,

2010

, vol.

20

(pg.

265

-

272

)

GeneMark.hmm: new solutions for gene finding

,

Nucleic Acids Res.

,

1998

, vol.

26

(pg.

1107

-

1115

)

et al.

TigrScan and GlimmerHMM: two open source ab initio eukaryotic gene-finders

,

Bioinformatics

,

2004

, vol.

20

(pg.

2878

-

2879

)

et al.

Normalized N50 assembly metric using gap-restricted co-linear chaining

,

BMC Bioinformatics

,

2012

, vol.

13

pg.

255

et al.

IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth

,

Bioinformatics

,

2012

, vol.

28

(pg.

1

-

8

)

et al.

An Eulerian path approach to DNA fragment assembly

,

Proc. Natl. Acad. Sci. USA

,

2001

, vol.

98

(pg.

9748

-

9753

)

et al.

GAGE: a critical evaluation of genome assemblies and assembly algorithms

,

Genome Res.

,

2011

, vol.

22

(pg.

557

-

567

)

Velvet: algorithms for de novo short read assembly using de Bruijn graphs

,

Genome Res.

,

2008

, vol.

18

(pg.

821

-

829

)

Author notes

Associate Editor: Michael Brudno

© The Author 2013. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com

Supplementary data

Citations

Views

Altmetric

Metrics

Total Views 80,352

64,336 Pageviews

16,016 PDF Downloads

Since 11/1/2016

Month: Total Views:
November 2016 29
December 2016 70
January 2017 185
February 2017 273
March 2017 278
April 2017 262
May 2017 200
June 2017 235
July 2017 230
August 2017 216
September 2017 185
October 2017 226
November 2017 231
December 2017 908
January 2018 746
February 2018 982
March 2018 1,115
April 2018 989
May 2018 933
June 2018 1,081
July 2018 773
August 2018 1,042
September 2018 724
October 2018 765
November 2018 846
December 2018 539
January 2019 511
February 2019 753
March 2019 1,027
April 2019 744
May 2019 793
June 2019 715
July 2019 972
August 2019 810
September 2019 912
October 2019 852
November 2019 890
December 2019 734
January 2020 877
February 2020 1,117
March 2020 956
April 2020 971
May 2020 631
June 2020 1,037
July 2020 1,016
August 2020 711
September 2020 853
October 2020 974
November 2020 777
December 2020 691
January 2021 677
February 2021 740
March 2021 923
April 2021 831
May 2021 763
June 2021 666
July 2021 765
August 2021 685
September 2021 771
October 2021 927
November 2021 855
December 2021 754
January 2022 798
February 2022 927
March 2022 1,112
April 2022 1,095
May 2022 1,103
June 2022 1,075
July 2022 883
August 2022 829
September 2022 936
October 2022 1,123
November 2022 1,133
December 2022 1,086
January 2023 1,023
February 2023 1,255
March 2023 1,433
April 2023 1,307
May 2023 1,323
June 2023 882
July 2023 881
August 2023 906
September 2023 965
October 2023 1,057
November 2023 1,016
December 2023 1,170
January 2024 1,175
February 2024 1,075
March 2024 1,415
April 2024 1,394
May 2024 1,291
June 2024 1,086
July 2024 1,007
August 2024 983
September 2024 1,098
October 2024 766

×

Email alerts

Citing articles via

More from Oxford Academic