MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph (original) (raw)

Journal Article

,

1HKU-BGI Bioinformatics Algorithms Research Laboratory & Department of Computer Science, University of Hong Kong, Hong Kong, 2L3 Bioinformatics Limited, Hong Kong and 3National Institute of Informatics, Chiyoda-ku, Tokyo, Japan

Search for other works by this author on:

,

1HKU-BGI Bioinformatics Algorithms Research Laboratory & Department of Computer Science, University of Hong Kong, Hong Kong, 2L3 Bioinformatics Limited, Hong Kong and 3National Institute of Informatics, Chiyoda-ku, Tokyo, Japan

Search for other works by this author on:

,

1HKU-BGI Bioinformatics Algorithms Research Laboratory & Department of Computer Science, University of Hong Kong, Hong Kong, 2L3 Bioinformatics Limited, Hong Kong and 3National Institute of Informatics, Chiyoda-ku, Tokyo, Japan

Search for other works by this author on:

,

1HKU-BGI Bioinformatics Algorithms Research Laboratory & Department of Computer Science, University of Hong Kong, Hong Kong, 2L3 Bioinformatics Limited, Hong Kong and 3National Institute of Informatics, Chiyoda-ku, Tokyo, Japan

Search for other works by this author on:

1HKU-BGI Bioinformatics Algorithms Research Laboratory & Department of Computer Science, University of Hong Kong, Hong Kong, 2L3 Bioinformatics Limited, Hong Kong and 3National Institute of Informatics, Chiyoda-ku, Tokyo, Japan

1HKU-BGI Bioinformatics Algorithms Research Laboratory & Department of Computer Science, University of Hong Kong, Hong Kong, 2L3 Bioinformatics Limited, Hong Kong and 3National Institute of Informatics, Chiyoda-ku, Tokyo, Japan

*To whom correspondence should be addressed.

Search for other works by this author on:

†The authors wish it to be known that, in their opinion, the first three authors should be regarded as Joint First Authors.

Associate Editor: Inanc Birol

Author Notes

Received:

25 September 2014

Revision received:

17 December 2014

Accepted:

14 January 2015

Published:

20 January 2015

Cite

Dinghua Li, Chi-Man Liu, Ruibang Luo, Kunihiko Sadakane, Tak-Wah Lam, MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph, Bioinformatics, Volume 31, Issue 10, May 2015, Pages 1674–1676, https://doi.org/10.1093/bioinformatics/btv033
Close

Navbar Search Filter Mobile Enter search term Search

Abstract

Summary: MEGAHIT is a NGS de novo assembler for assembling large and complex metagenomics data in a time- and cost-efficient manner. It finished assembling a soil metagenomics dataset with 252 Gbps in 44.1 and 99.6 h on a single computing node with and without a graphics processing unit, respectively. MEGAHIT assembles the data as a whole, i.e. no pre-processing like partitioning and normalization was needed. When compared with previous methods on assembling the soil data, MEGAHIT generated a three-time larger assembly, with longer contig N50 and average contig length; furthermore, 55.8% of the reads were aligned to the assembly, giving a fourfold improvement.

Availability and implementation: The source code of MEGAHIT is freely available at https://github.com/voutcn/megahit under GPLv3 license.

Contact: rb@l3-bioinfo.com or twlam@cs.hku.hk

Supplementary information: Supplementary data are available at Bioinformatics online.

1 Introduction

Next generation sequencing technologies have offered new opportunities to study metagenomics and understand various microbial communities such as human guts, rumen and soil. Due to the lack of reference genomes, de novo assembly of metagenomics data (short reads) is a beneficial and almost inevitable step for metagenomics analysis (Qin et al., 2010). This step is, however, constrained by the heavy requirement of computational resources, especially for large and complex datasets encountered in environmental metagenomics (Howe et al., 2014). The soil metagenomics dataset recently published by Howe et al. comprises 252 Gbp even after trimming low quality bases. The dataset was successfully assembled with pre-processing steps including partitioning and digital normalization. At present no de novo assembler can assemble the data as a whole using a feasible amount of computer memory. Estimated memory requirement for SOAPdenovo2 (Luo et al., 2012) and IDBA-UD (Peng et al., 2012) to assemble the soil data is at least 4 TB. As the volume of metagenomics data keeps growing, we are motivated to develop MEGAHIT, an assembler that can assemble large and complex metagenomics data in a time- and cost-efficient manner, especially on a single-node server (current maximum memory capacity 768 GB for a 2-socket server).

2 Methods

MEGAHIT makes use of succinct de Bruijn graphs (SdBG; Bowe et al., 2012), which are compressed representation of de Bruijn graphs. A SdBG encodes a graph with m edges in O(m) bits, and supports O(1) time traversal from a vertex to its neighbors. Our implementation has added a bit-vector of length m to mark the validity of each edge (so as to support dynamic removal of edges efficiently), and an auxiliary vector of 2kt bits (where k is the k-mer size and t is the number of zero-indegree vertices) to store the sequence of zero-indegree vertices to ensure the graph being lossless.

Despite its advantages, constructing a SdBG efficiently is non-trivial. MEGAHIT is rooted in a fast parallel algorithm for SdBG construction; the bottleneck is sorting a set of (k+1)-mers that are the edges of an SdBG in reverse lexicographical order of their length-k prefixes (_k_-mers). MEGAHIT exploits the parallelism of a graphics processing unit (GPU, CUDA-enabled) by adapting the recent BWT-construction algorithm CX1 (Liu et al., 2014), which takes advantage of a GPU to sort the suffices of a set of reads very efficiently. Limited by the relatively small size of GPU’s on-board memory, we adopt a block-wise strategy that partitions the _k_-mers according to their length-l prefix (l = 8 in our implementation). The _k_-mers in consecutive partitions that fit within the GPU memory are sorted together. Leveraging the parallelism of GPU, MEGAHIT speeds up the construction by 3–5 times over its CPU-only counterpart.

Notably, sequencing error is problematic, because a single base of sequencing error leads to k erroneous _k_-mer singletons, which increases the memory consumption of MEGAHIT significantly. To cope with the problem, before graph construction, all (k + 1)-mers from the input reads are sorted and counted, and only (k + 1)-mers that appear at least d (2 by default) times are kept as solid-kmer. This method removes many spurious edges, but may be risky for metagenomics assembly since many low-abundance species may have been sequenced at very low depth. Thus we introduce a mercy-kmer strategy to recover these low-depth edges. Given two solid (k + 1)-mers x and y from the same read, where x has no outdegree and y has no indegree. If all (k + 1)-mers between x and y in that read are not solid, they will be added to the de Bruijn graph as mercy-kmers. Mercy-kmers strengthen the contiguity of low-depth regions. Without this approach, many authentic low-depth edges would be incorrectly identified as tips and removed.

Based on SdBG, we implemented a multiple _k_-mer size strategy in MEGAHIT (Peng et al., 2012). The method iteratively builds multiple SdBGs from a small k to a large k. While a small _k_-mer size is favourable for filtering erroneous edges and filling gaps in low-coverage regions, a large _k_-mer size is useful for resolving repeats. In each iteration, MEGAHIT cleans potentially erroneous edges by removing tips, merging bubbles and removing low local coverage edges. The last approach is especially useful for metagenomics, which suffers from non-uniform sequencing depths. The overall workflow of MEGAHIT is shown in Figure 1.

The workflow of MEGAHIT

Fig. 1.

The workflow of MEGAHIT

3 Results

Table 1 compares the performance of MEGAHIT with SPAdes (Bankevich et al., 2012) on three subsets (100-fold, 20-fold and 10-fold) of an E. coli MG1655 dataset. QUAST (Gurevich et al., 2013) was used to evaluate the assembled contigs (Table 1). MEGAHIT (CPU version) is six times faster than SPAdes, and performs well even on the low-coverage subset.

Table 1.

Performance of MEGAHIT and SPAdes on the E.coli dataset

MEGAHIT 100× MEGAHIT 20× MEGAHIT 10× SPAdes 10×
N50 (bp) 73 736 52 352 9067 18 264
Largest alignment (bp) 221k 178k 31k 62k
bp in contigs > = 1 kbp 4.55 M 4.55 M 4.52 M 4.55 M
Genome fraction 98.0% 98.1% 97.4% 97.9%
Misassemblies (bp) 2k 41k 81k 64k
Wall time (s) 185 82 47 318
MEGAHIT 100× MEGAHIT 20× MEGAHIT 10× SPAdes 10×
N50 (bp) 73 736 52 352 9067 18 264
Largest alignment (bp) 221k 178k 31k 62k
bp in contigs > = 1 kbp 4.55 M 4.55 M 4.52 M 4.55 M
Genome fraction 98.0% 98.1% 97.4% 97.9%
Misassemblies (bp) 2k 41k 81k 64k
Wall time (s) 185 82 47 318

MEGAHIT: CPU version, options ‘--k-min 21 --k-max 81 -m 1 000 000 000’; SPAdes and QUAST was run with default parameters.

Table 1.

Performance of MEGAHIT and SPAdes on the E.coli dataset

MEGAHIT 100× MEGAHIT 20× MEGAHIT 10× SPAdes 10×
N50 (bp) 73 736 52 352 9067 18 264
Largest alignment (bp) 221k 178k 31k 62k
bp in contigs > = 1 kbp 4.55 M 4.55 M 4.52 M 4.55 M
Genome fraction 98.0% 98.1% 97.4% 97.9%
Misassemblies (bp) 2k 41k 81k 64k
Wall time (s) 185 82 47 318
MEGAHIT 100× MEGAHIT 20× MEGAHIT 10× SPAdes 10×
N50 (bp) 73 736 52 352 9067 18 264
Largest alignment (bp) 221k 178k 31k 62k
bp in contigs > = 1 kbp 4.55 M 4.55 M 4.52 M 4.55 M
Genome fraction 98.0% 98.1% 97.4% 97.9%
Misassemblies (bp) 2k 41k 81k 64k
Wall time (s) 185 82 47 318

MEGAHIT: CPU version, options ‘--k-min 21 --k-max 81 -m 1 000 000 000’; SPAdes and QUAST was run with default parameters.

To evaluate the performance on large scale metagenomics data, we assembled an Iowa prairie soil metagenomics dataset that comprises 3.3 billion reads totaling 252 billion base-pairs (Howe et al., 2014) using MEGAHIT and Minia, another memory-efficient assembler (Chikhi and Rizk, 2012). The assembly conducted by Howe et al. was included for comparison (Table 2). On a server with 384 GB memory, MEGAHIT took 44.1 h, ∼7 times faster than Minia. It reached peak memory consumption at 345 GB during _k_-mer counting and SdBG construction; this matches the expectation since MEGAHIT’s sorting module automatically adjusts to fully utilize all available memory in a server. Notably, MEGAHIT can assemble this dataset with as little as 260 GB memory, using 55.3 h (Supplementary Section 4).

Table 2.

Summary statistics for MEGAHIT, Howe et al. and Minia

MEGAHIT Howe et al. Minia
Wall time (h) 44.1 >488 331.4
Peak memory (GB) 345 287 29
Total size (Mbp) 4902 1503 1490
Average length (bp) 633 485 505
N50 (bp) 657 471 488
Longest (bp) 184 210 9397 32 679
# of contigs 7 749 211 3 096 464 2 951 575
# of contigs ≥ 1kbp 841 257 129 513 158 402
MEGAHIT Howe et al. Minia
Wall time (h) 44.1 >488 331.4
Peak memory (GB) 345 287 29
Total size (Mbp) 4902 1503 1490
Average length (bp) 633 485 505
N50 (bp) 657 471 488
Longest (bp) 184 210 9397 32 679
# of contigs 7 749 211 3 096 464 2 951 575
# of contigs ≥ 1kbp 841 257 129 513 158 402

MEGAHIT utilizes all 24 CPU threads with options ‘--k-min 27 --k-max 87 --k-step 10 -m 370 000 000 000’. The wall time for CPU version of MEGAHIT is 99.4 h. Minia does not support multi-threads; it was run with k = 31 and min_abundance = 2. The time and memory of Howe et al. were excerpted from the paper; the time accounts for digital normalization and partitioning only.

Table 2.

Summary statistics for MEGAHIT, Howe et al. and Minia

MEGAHIT Howe et al. Minia
Wall time (h) 44.1 >488 331.4
Peak memory (GB) 345 287 29
Total size (Mbp) 4902 1503 1490
Average length (bp) 633 485 505
N50 (bp) 657 471 488
Longest (bp) 184 210 9397 32 679
# of contigs 7 749 211 3 096 464 2 951 575
# of contigs ≥ 1kbp 841 257 129 513 158 402
MEGAHIT Howe et al. Minia
Wall time (h) 44.1 >488 331.4
Peak memory (GB) 345 287 29
Total size (Mbp) 4902 1503 1490
Average length (bp) 633 485 505
N50 (bp) 657 471 488
Longest (bp) 184 210 9397 32 679
# of contigs 7 749 211 3 096 464 2 951 575
# of contigs ≥ 1kbp 841 257 129 513 158 402

MEGAHIT utilizes all 24 CPU threads with options ‘--k-min 27 --k-max 87 --k-step 10 -m 370 000 000 000’. The wall time for CPU version of MEGAHIT is 99.4 h. Minia does not support multi-threads; it was run with k = 31 and min_abundance = 2. The time and memory of Howe et al. were excerpted from the paper; the time accounts for digital normalization and partitioning only.

To be consistent with Howe’s analysis, we only considered contigs ≥ 300 bp for further analysis. The contigs produced by MEGAHIT had a total size at least three times larger than by other methods, and achieved better statistics on N50, average length, and the number of long contigs (length ≥ 1000 bp). Thus MEGAHIT gives better assembly contiguity. Raw reads were aligned back to the assembled contigs using Bowtie2 (Langmead and Salzberg, 2012). As shown in Table 3, MEGAHIT gets > 4 times more reads mapped and 5–6 times more read pairs properly aligned. 37% of distinct 17-mers appeared ≥ 2 in the assembly, which might imply that MEGAHIT did a better job in recovering low-abundance subspecies in ultra-diversified metagenomics (Supplementary Fig. S3).

Table 3.

Alignment statistics of MEGAHIT, Howe et al. and Minia

MEGAHIT Howe et al. Minia
Total # of reads 3 252 369 195
Reads overall aligned (%) 55.81 10.72 13.03
Total # of SE reads 356 742 333
SE aligned 1 time (%) 37.00 8.72 12.38
SE aligned > 1 time (%) 14.68 0.32 0.02
Total # of PE reads 1 447 813 431
PE p. aligned 1 time (%) 36.78 7.41 9.48
PE p. aligned > 1 time (%) 8.90 0.20 0.01
PE improperly aligned (%) 2.67 0.54 0.82
MEGAHIT Howe et al. Minia
Total # of reads 3 252 369 195
Reads overall aligned (%) 55.81 10.72 13.03
Total # of SE reads 356 742 333
SE aligned 1 time (%) 37.00 8.72 12.38
SE aligned > 1 time (%) 14.68 0.32 0.02
Total # of PE reads 1 447 813 431
PE p. aligned 1 time (%) 36.78 7.41 9.48
PE p. aligned > 1 time (%) 8.90 0.20 0.01
PE improperly aligned (%) 2.67 0.54 0.82

SE, single-end; PE, paired-end; p., properly; Bowtie2 were run with ‘-L 27’.

Table 3.

Alignment statistics of MEGAHIT, Howe et al. and Minia

MEGAHIT Howe et al. Minia
Total # of reads 3 252 369 195
Reads overall aligned (%) 55.81 10.72 13.03
Total # of SE reads 356 742 333
SE aligned 1 time (%) 37.00 8.72 12.38
SE aligned > 1 time (%) 14.68 0.32 0.02
Total # of PE reads 1 447 813 431
PE p. aligned 1 time (%) 36.78 7.41 9.48
PE p. aligned > 1 time (%) 8.90 0.20 0.01
PE improperly aligned (%) 2.67 0.54 0.82
MEGAHIT Howe et al. Minia
Total # of reads 3 252 369 195
Reads overall aligned (%) 55.81 10.72 13.03
Total # of SE reads 356 742 333
SE aligned 1 time (%) 37.00 8.72 12.38
SE aligned > 1 time (%) 14.68 0.32 0.02
Total # of PE reads 1 447 813 431
PE p. aligned 1 time (%) 36.78 7.41 9.48
PE p. aligned > 1 time (%) 8.90 0.20 0.01
PE improperly aligned (%) 2.67 0.54 0.82

SE, single-end; PE, paired-end; p., properly; Bowtie2 were run with ‘-L 27’.

4 Conclusions

MEGAHIT enables an efficient assembly of large and complex metagenomics data on a single server, while giving better completeness and contiguity. MEGAHIT is available in both CPU-only and GPU-accelerated versions. With GPU, the assembly time of the soil dataset is shortened from 4 days to less than 2 days.

Acknowledgements

The authors thank S.M. Yiu, C.M. Leung and Y. Peng for the detailed explanation about IDBA-UD. The authors also thank C. Titus Brown for providing the open evaluation with the E.coli data (Table 1).

Funding

This work was funded by Hong Kong GRF (General Research Fund) HKU-713512E and ITF (Innovation and Technology Fund) GHP/011/12. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Conflict of Interest: None declared.

References

et al. . (

2012

)

SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing

.

J. Comput. Biol.

,

19

,

455

477

.

et al. . (

2012

)

Succinct de Bruijn Graphs

. In: (eds.)

Algorithms in Bioinformatics

.

Springer

,

Berlin

, pp.

225

235

.

(

2012

)

Space-efficient and exact de Bruijn graph representation based on a bloom filter

. In: (eds.),

Algorithms in Bioinformatics

.

Springer

,

Berlin

, pp.

236

248

.

et al. . (

2013

)

QUAST: quality assessment tool for genome assemblies

.

Bioinformatics

,

29

,

1072

1075

.

et al. . (

2014

)

Tackling soil diversity with the assembly of large, complex metagenomes

.

Proc. Natl Acad. Sci. USA

,

111

,

4904

4909

.

(

2012

)

Fast gapped-read alignment with Bowtie 2

.

Nat. Methods

,

9

,

357

359

.

et al. . (

2014

)

GPU-accelerated BWT construction for large collection of short reads

.

arXiv

:

1401.7457

.

et al. . (

2012

)

SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler

.

GigaScience

,

1

,

18

.

et al. . (

2012

)

IDBA-UD: a de novo assembler for single-cell and meta-genomic sequencing data with highly uneven depth

.

Bioinformatics

,

28

,

1420

1428

.

et al. . (

2010

)

A human gut microbial gene catalogue established by metagenomic sequencing

.

Nature

,

464

,

59

65

.

Author notes

†The authors wish it to be known that, in their opinion, the first three authors should be regarded as Joint First Authors.

Associate Editor: Inanc Birol

© The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com

Supplementary data

Citations

Views

Altmetric

Metrics

Total Views 78,689

61,601 Pageviews

17,088 PDF Downloads

Since 11/1/2016

Month: Total Views:
November 2016 38
December 2016 30
January 2017 113
February 2017 170
March 2017 236
April 2017 161
May 2017 239
June 2017 190
July 2017 219
August 2017 194
September 2017 227
October 2017 216
November 2017 239
December 2017 366
January 2018 412
February 2018 362
March 2018 501
April 2018 424
May 2018 350
June 2018 469
July 2018 456
August 2018 425
September 2018 423
October 2018 456
November 2018 463
December 2018 355
January 2019 441
February 2019 428
March 2019 598
April 2019 601
May 2019 620
June 2019 569
July 2019 876
August 2019 763
September 2019 982
October 2019 929
November 2019 810
December 2019 685
January 2020 783
February 2020 1,022
March 2020 893
April 2020 750
May 2020 612
June 2020 954
July 2020 1,024
August 2020 678
September 2020 759
October 2020 791
November 2020 687
December 2020 687
January 2021 524
February 2021 618
March 2021 820
April 2021 802
May 2021 753
June 2021 786
July 2021 891
August 2021 742
September 2021 1,084
October 2021 1,162
November 2021 1,176
December 2021 1,032
January 2022 1,107
February 2022 1,247
March 2022 1,320
April 2022 1,418
May 2022 1,379
June 2022 1,036
July 2022 1,061
August 2022 1,056
September 2022 1,020
October 2022 1,200
November 2022 1,159
December 2022 944
January 2023 906
February 2023 1,185
March 2023 1,300
April 2023 1,277
May 2023 1,283
June 2023 998
July 2023 1,164
August 2023 1,138
September 2023 1,195
October 2023 1,380
November 2023 1,241
December 2023 1,242
January 2024 1,566
February 2024 1,693
March 2024 2,009
April 2024 1,517
May 2024 1,511
June 2024 1,248
July 2024 1,358
August 2024 1,295
September 2024 1,319
October 2024 851

×

Email alerts

Citing articles via

More from Oxford Academic