MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph (original) (raw)

Journal Article

1HKU-BGI Bioinformatics Algorithms Research Laboratory & Department of Computer Science, University of Hong Kong, Hong Kong, 2L3 Bioinformatics Limited, Hong Kong and 3National Institute of Informatics, Chiyoda-ku, Tokyo, Japan

Search for other works by this author on:

*To whom correspondence should be addressed.

Search for other works by this author on:

†The authors wish it to be known that, in their opinion, the first three authors should be regarded as Joint First Authors.

Associate Editor: Inanc Birol

Author Notes

Received:

25 September 2014

Revision received:

17 December 2014

Accepted:

14 January 2015

Published:

20 January 2015

Cite

Dinghua Li, Chi-Man Liu, Ruibang Luo, Kunihiko Sadakane, Tak-Wah Lam, MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph, Bioinformatics, Volume 31, Issue 10, May 2015, Pages 1674–1676, https://doi.org/10.1093/bioinformatics/btv033
Close

Navbar Search Filter Mobile Enter search term Search

Abstract

Summary: MEGAHIT is a NGS de novo assembler for assembling large and complex metagenomics data in a time- and cost-efficient manner. It finished assembling a soil metagenomics dataset with 252 Gbps in 44.1 and 99.6 h on a single computing node with and without a graphics processing unit, respectively. MEGAHIT assembles the data as a whole, i.e. no pre-processing like partitioning and normalization was needed. When compared with previous methods on assembling the soil data, MEGAHIT generated a three-time larger assembly, with longer contig N50 and average contig length; furthermore, 55.8% of the reads were aligned to the assembly, giving a fourfold improvement.

Availability and implementation: The source code of MEGAHIT is freely available at https://github.com/voutcn/megahit under GPLv3 license.

Contact: rb@l3-bioinfo.com or twlam@cs.hku.hk

Supplementary information: Supplementary data are available at Bioinformatics online.

1 Introduction

Next generation sequencing technologies have offered new opportunities to study metagenomics and understand various microbial communities such as human guts, rumen and soil. Due to the lack of reference genomes, de novo assembly of metagenomics data (short reads) is a beneficial and almost inevitable step for metagenomics analysis (Qin et al., 2010). This step is, however, constrained by the heavy requirement of computational resources, especially for large and complex datasets encountered in environmental metagenomics (Howe et al., 2014). The soil metagenomics dataset recently published by Howe et al. comprises 252 Gbp even after trimming low quality bases. The dataset was successfully assembled with pre-processing steps including partitioning and digital normalization. At present no de novo assembler can assemble the data as a whole using a feasible amount of computer memory. Estimated memory requirement for SOAPdenovo2 (Luo et al., 2012) and IDBA-UD (Peng et al., 2012) to assemble the soil data is at least 4 TB. As the volume of metagenomics data keeps growing, we are motivated to develop MEGAHIT, an assembler that can assemble large and complex metagenomics data in a time- and cost-efficient manner, especially on a single-node server (current maximum memory capacity 768 GB for a 2-socket server).

2 Methods

MEGAHIT makes use of succinct de Bruijn graphs (SdBG; Bowe et al., 2012), which are compressed representation of de Bruijn graphs. A SdBG encodes a graph with m edges in O(m) bits, and supports O(1) time traversal from a vertex to its neighbors. Our implementation has added a bit-vector of length m to mark the validity of each edge (so as to support dynamic removal of edges efficiently), and an auxiliary vector of 2kt bits (where k is the k-mer size and t is the number of zero-indegree vertices) to store the sequence of zero-indegree vertices to ensure the graph being lossless.

Despite its advantages, constructing a SdBG efficiently is non-trivial. MEGAHIT is rooted in a fast parallel algorithm for SdBG construction; the bottleneck is sorting a set of (k+1)-mers that are the edges of an SdBG in reverse lexicographical order of their length-k prefixes (_k_-mers). MEGAHIT exploits the parallelism of a graphics processing unit (GPU, CUDA-enabled) by adapting the recent BWT-construction algorithm CX1 (Liu et al., 2014), which takes advantage of a GPU to sort the suffices of a set of reads very efficiently. Limited by the relatively small size of GPU’s on-board memory, we adopt a block-wise strategy that partitions the _k_-mers according to their length-l prefix (l = 8 in our implementation). The _k_-mers in consecutive partitions that fit within the GPU memory are sorted together. Leveraging the parallelism of GPU, MEGAHIT speeds up the construction by 3–5 times over its CPU-only counterpart.

Notably, sequencing error is problematic, because a single base of sequencing error leads to k erroneous _k_-mer singletons, which increases the memory consumption of MEGAHIT significantly. To cope with the problem, before graph construction, all (k + 1)-mers from the input reads are sorted and counted, and only (k + 1)-mers that appear at least d (2 by default) times are kept as solid-kmer. This method removes many spurious edges, but may be risky for metagenomics assembly since many low-abundance species may have been sequenced at very low depth. Thus we introduce a mercy-kmer strategy to recover these low-depth edges. Given two solid (k + 1)-mers x and y from the same read, where x has no outdegree and y has no indegree. If all (k + 1)-mers between x and y in that read are not solid, they will be added to the de Bruijn graph as mercy-kmers. Mercy-kmers strengthen the contiguity of low-depth regions. Without this approach, many authentic low-depth edges would be incorrectly identified as tips and removed.

Based on SdBG, we implemented a multiple _k_-mer size strategy in MEGAHIT (Peng et al., 2012). The method iteratively builds multiple SdBGs from a small k to a large k. While a small _k_-mer size is favourable for filtering erroneous edges and filling gaps in low-coverage regions, a large _k_-mer size is useful for resolving repeats. In each iteration, MEGAHIT cleans potentially erroneous edges by removing tips, merging bubbles and removing low local coverage edges. The last approach is especially useful for metagenomics, which suffers from non-uniform sequencing depths. The overall workflow of MEGAHIT is shown in Figure 1.

The workflow of MEGAHIT

Fig. 1.

The workflow of MEGAHIT

3 Results

Table 1 compares the performance of MEGAHIT with SPAdes (Bankevich et al., 2012) on three subsets (100-fold, 20-fold and 10-fold) of an E. coli MG1655 dataset. QUAST (Gurevich et al., 2013) was used to evaluate the assembled contigs (Table 1). MEGAHIT (CPU version) is six times faster than SPAdes, and performs well even on the low-coverage subset.

Table 1.

Performance of MEGAHIT and SPAdes on the E.coli dataset

MEGAHIT 100×	MEGAHIT 20×	MEGAHIT 10×	SPAdes 10×
N50 (bp)	73 736	52 352	9067	18 264
Largest alignment (bp)	221k	178k	31k	62k
bp in contigs > = 1 kbp	4.55 M	4.55 M	4.52 M	4.55 M
Genome fraction	98.0%	98.1%	97.4%	97.9%
Misassemblies (bp)	2k	41k	81k	64k
Wall time (s)	185	82	47	318

MEGAHIT 100×	MEGAHIT 20×	MEGAHIT 10×	SPAdes 10×
N50 (bp)	73 736	52 352	9067	18 264
Largest alignment (bp)	221k	178k	31k	62k
bp in contigs > = 1 kbp	4.55 M	4.55 M	4.52 M	4.55 M
Genome fraction	98.0%	98.1%	97.4%	97.9%
Misassemblies (bp)	2k	41k	81k	64k
Wall time (s)	185	82	47	318

MEGAHIT: CPU version, options ‘--k-min 21 --k-max 81 -m 1 000 000 000’; SPAdes and QUAST was run with default parameters.

Table 1.

Performance of MEGAHIT and SPAdes on the E.coli dataset

MEGAHIT 100×	MEGAHIT 20×	MEGAHIT 10×	SPAdes 10×
N50 (bp)	73 736	52 352	9067	18 264
Largest alignment (bp)	221k	178k	31k	62k
bp in contigs > = 1 kbp	4.55 M	4.55 M	4.52 M	4.55 M
Genome fraction	98.0%	98.1%	97.4%	97.9%
Misassemblies (bp)	2k	41k	81k	64k
Wall time (s)	185	82	47	318

MEGAHIT 100×	MEGAHIT 20×	MEGAHIT 10×	SPAdes 10×
N50 (bp)	73 736	52 352	9067	18 264
Largest alignment (bp)	221k	178k	31k	62k
bp in contigs > = 1 kbp	4.55 M	4.55 M	4.52 M	4.55 M
Genome fraction	98.0%	98.1%	97.4%	97.9%
Misassemblies (bp)	2k	41k	81k	64k
Wall time (s)	185	82	47	318

MEGAHIT: CPU version, options ‘--k-min 21 --k-max 81 -m 1 000 000 000’; SPAdes and QUAST was run with default parameters.

To evaluate the performance on large scale metagenomics data, we assembled an Iowa prairie soil metagenomics dataset that comprises 3.3 billion reads totaling 252 billion base-pairs (Howe et al., 2014) using MEGAHIT and Minia, another memory-efficient assembler (Chikhi and Rizk, 2012). The assembly conducted by Howe et al. was included for comparison (Table 2). On a server with 384 GB memory, MEGAHIT took 44.1 h, ∼7 times faster than Minia. It reached peak memory consumption at 345 GB during _k_-mer counting and SdBG construction; this matches the expectation since MEGAHIT’s sorting module automatically adjusts to fully utilize all available memory in a server. Notably, MEGAHIT can assemble this dataset with as little as 260 GB memory, using 55.3 h (Supplementary Section 4).

Table 2.

Summary statistics for MEGAHIT, Howe et al. and Minia

MEGAHIT	Howe et al.	Minia
Wall time (h)	44.1	>488	331.4
Peak memory (GB)	345	287	29
Total size (Mbp)	4902	1503	1490
Average length (bp)	633	485	505
N50 (bp)	657	471	488
Longest (bp)	184 210	9397	32 679
# of contigs	7 749 211	3 096 464	2 951 575
# of contigs ≥ 1kbp	841 257	129 513	158 402

MEGAHIT	Howe et al.	Minia
Wall time (h)	44.1	>488	331.4
Peak memory (GB)	345	287	29
Total size (Mbp)	4902	1503	1490
Average length (bp)	633	485	505
N50 (bp)	657	471	488
Longest (bp)	184 210	9397	32 679
# of contigs	7 749 211	3 096 464	2 951 575
# of contigs ≥ 1kbp	841 257	129 513	158 402

MEGAHIT utilizes all 24 CPU threads with options ‘--k-min 27 --k-max 87 --k-step 10 -m 370 000 000 000’. The wall time for CPU version of MEGAHIT is 99.4 h. Minia does not support multi-threads; it was run with k = 31 and min_abundance = 2. The time and memory of Howe et al. were excerpted from the paper; the time accounts for digital normalization and partitioning only.

Table 2.

Summary statistics for MEGAHIT, Howe et al. and Minia

MEGAHIT	Howe et al.	Minia
Wall time (h)	44.1	>488	331.4
Peak memory (GB)	345	287	29
Total size (Mbp)	4902	1503	1490
Average length (bp)	633	485	505
N50 (bp)	657	471	488
Longest (bp)	184 210	9397	32 679
# of contigs	7 749 211	3 096 464	2 951 575
# of contigs ≥ 1kbp	841 257	129 513	158 402

MEGAHIT	Howe et al.	Minia
Wall time (h)	44.1	>488	331.4
Peak memory (GB)	345	287	29
Total size (Mbp)	4902	1503	1490
Average length (bp)	633	485	505
N50 (bp)	657	471	488
Longest (bp)	184 210	9397	32 679
# of contigs	7 749 211	3 096 464	2 951 575
# of contigs ≥ 1kbp	841 257	129 513	158 402

To be consistent with Howe’s analysis, we only considered contigs ≥ 300 bp for further analysis. The contigs produced by MEGAHIT had a total size at least three times larger than by other methods, and achieved better statistics on N50, average length, and the number of long contigs (length ≥ 1000 bp). Thus MEGAHIT gives better assembly contiguity. Raw reads were aligned back to the assembled contigs using Bowtie2 (Langmead and Salzberg, 2012). As shown in Table 3, MEGAHIT gets > 4 times more reads mapped and 5–6 times more read pairs properly aligned. 37% of distinct 17-mers appeared ≥ 2 in the assembly, which might imply that MEGAHIT did a better job in recovering low-abundance subspecies in ultra-diversified metagenomics (Supplementary Fig. S3).

Table 3.

Alignment statistics of MEGAHIT, Howe et al. and Minia

MEGAHIT	Howe et al.	Minia
Total # of reads	3 252 369 195
Reads overall aligned (%)	55.81	10.72	13.03
Total # of SE reads	356 742 333
SE aligned 1 time (%)	37.00	8.72	12.38
SE aligned > 1 time (%)	14.68	0.32	0.02
Total # of PE reads	1 447 813 431
PE p. aligned 1 time (%)	36.78	7.41	9.48
PE p. aligned > 1 time (%)	8.90	0.20	0.01
PE improperly aligned (%)	2.67	0.54	0.82

MEGAHIT	Howe et al.	Minia
Total # of reads	3 252 369 195
Reads overall aligned (%)	55.81	10.72	13.03
Total # of SE reads	356 742 333
SE aligned 1 time (%)	37.00	8.72	12.38
SE aligned > 1 time (%)	14.68	0.32	0.02
Total # of PE reads	1 447 813 431
PE p. aligned 1 time (%)	36.78	7.41	9.48
PE p. aligned > 1 time (%)	8.90	0.20	0.01
PE improperly aligned (%)	2.67	0.54	0.82

SE, single-end; PE, paired-end; p., properly; Bowtie2 were run with ‘-L 27’.

Table 3.

Alignment statistics of MEGAHIT, Howe et al. and Minia

MEGAHIT	Howe et al.	Minia
Total # of reads	3 252 369 195
Reads overall aligned (%)	55.81	10.72	13.03
Total # of SE reads	356 742 333
SE aligned 1 time (%)	37.00	8.72	12.38
SE aligned > 1 time (%)	14.68	0.32	0.02
Total # of PE reads	1 447 813 431
PE p. aligned 1 time (%)	36.78	7.41	9.48
PE p. aligned > 1 time (%)	8.90	0.20	0.01
PE improperly aligned (%)	2.67	0.54	0.82

MEGAHIT	Howe et al.	Minia
Total # of reads	3 252 369 195
Reads overall aligned (%)	55.81	10.72	13.03
Total # of SE reads	356 742 333
SE aligned 1 time (%)	37.00	8.72	12.38
SE aligned > 1 time (%)	14.68	0.32	0.02
Total # of PE reads	1 447 813 431
PE p. aligned 1 time (%)	36.78	7.41	9.48
PE p. aligned > 1 time (%)	8.90	0.20	0.01
PE improperly aligned (%)	2.67	0.54	0.82

SE, single-end; PE, paired-end; p., properly; Bowtie2 were run with ‘-L 27’.

4 Conclusions

MEGAHIT enables an efficient assembly of large and complex metagenomics data on a single server, while giving better completeness and contiguity. MEGAHIT is available in both CPU-only and GPU-accelerated versions. With GPU, the assembly time of the soil dataset is shortened from 4 days to less than 2 days.

Acknowledgements

The authors thank S.M. Yiu, C.M. Leung and Y. Peng for the detailed explanation about IDBA-UD. The authors also thank C. Titus Brown for providing the open evaluation with the E.coli data (Table 1).

Funding

This work was funded by Hong Kong GRF (General Research Fund) HKU-713512E and ITF (Innovation and Technology Fund) GHP/011/12. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Conflict of Interest: None declared.

References

et al. . (

2012

)

SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing

J. Comput. Biol.

455

–

477

et al. . (

2012

)

Succinct de Bruijn Graphs

. In: (eds.)

Algorithms in Bioinformatics

Springer

Berlin

, pp.

225

–

235

(

2012

)

Space-efficient and exact de Bruijn graph representation based on a bloom filter

. In: (eds.),

Algorithms in Bioinformatics

Springer

Berlin

, pp.

236

–

248

et al. . (

2013

)

QUAST: quality assessment tool for genome assemblies

Bioinformatics

1072

–

1075

et al. . (

2014

)

Tackling soil diversity with the assembly of large, complex metagenomes

Proc. Natl Acad. Sci. USA

111

4904

–

4909

(

2012

)

Fast gapped-read alignment with Bowtie 2

Nat. Methods

357

–

359

et al. . (

2014

)

GPU-accelerated BWT construction for large collection of short reads

arXiv

1401.7457

et al. . (

2012

)

SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler

GigaScience

et al. . (

2012

)

IDBA-UD: a de novo assembler for single-cell and meta-genomic sequencing data with highly uneven depth

Bioinformatics

1420

–

1428

et al. . (

2010

)

A human gut microbial gene catalogue established by metagenomic sequencing

Nature

464

–

Author notes

†The authors wish it to be known that, in their opinion, the first three authors should be regarded as Joint First Authors.

Associate Editor: Inanc Birol

Supplementary data

Citations

Views

Altmetric

Metrics

Total Views 78,689

61,601 Pageviews

17,088 PDF Downloads

Since 11/1/2016

Month:	Total Views:
November 2016	38
December 2016	30
January 2017	113
February 2017	170
March 2017	236
April 2017	161
May 2017	239
June 2017	190
July 2017	219
August 2017	194
September 2017	227
October 2017	216
November 2017	239
December 2017	366
January 2018	412
February 2018	362
March 2018	501
April 2018	424
May 2018	350
June 2018	469
July 2018	456
August 2018	425
September 2018	423
October 2018	456
November 2018	463
December 2018	355
January 2019	441
February 2019	428
March 2019	598
April 2019	601
May 2019	620
June 2019	569
July 2019	876
August 2019	763
September 2019	982
October 2019	929
November 2019	810
December 2019	685
January 2020	783
February 2020	1,022
March 2020	893
April 2020	750
May 2020	612
June 2020	954
July 2020	1,024
August 2020	678
September 2020	759
October 2020	791
November 2020	687
December 2020	687
January 2021	524
February 2021	618
March 2021	820
April 2021	802
May 2021	753
June 2021	786
July 2021	891
August 2021	742
September 2021	1,084
October 2021	1,162
November 2021	1,176
December 2021	1,032
January 2022	1,107
February 2022	1,247
March 2022	1,320
April 2022	1,418
May 2022	1,379
June 2022	1,036
July 2022	1,061
August 2022	1,056
September 2022	1,020
October 2022	1,200
November 2022	1,159
December 2022	944
January 2023	906
February 2023	1,185
March 2023	1,300
April 2023	1,277
May 2023	1,283
June 2023	998
July 2023	1,164
August 2023	1,138
September 2023	1,195
October 2023	1,380
November 2023	1,241
December 2023	1,242
January 2024	1,566
February 2024	1,693
March 2024	2,009
April 2024	1,517
May 2024	1,511
June 2024	1,248
July 2024	1,358
August 2024	1,295
September 2024	1,319
October 2024	851

MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph (original) (raw)

Cite

Abstract

1 Introduction

2 Methods

3 Results

4 Conclusions

Acknowledgements

Funding

References

Author notes

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Cited

MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph (original) (raw)

Cite

Abstract

1 Introduction

2 Methods

3 Results

4 Conclusions

Acknowledgements

Funding

References

Author notes

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited