GigaDB Dataset - DOI 10.5524/100344 (original) (raw)

In just over a decade, metagenomics has developed into a powerful and productive method in microbiology and microbial ecology. The ability to retrieve and organize bits and pieces of genomic DNA from any natural context has opened a window into the vast universe of uncultivated microbes. Tremendous progress has been made in computational approaches to interpret this sequence data but none can completely recover the complex information encoded in metagenomes. A number of challenges stand in the way. Simplifying assumptions are needed and lead to strong limitations and potential inaccuracies in practice. Critically, methodological improvements are difficult to gauge due to the lack of a general standard for comparison. Developers also face a substantial burden to individually evaluate existing approaches, which consumes time and computational resources, and may introduce unintended biases.

The Critical Assessment of Metagenome Interpretation (CAMI) is a community-led initiative that tackles these problems by aiming for an independent, comprehensive and bias-free evaluation of methods. In the first CAMI challenge running from March to July 2015, it provided three simulated benchmark metagenome datasets of different organismal complexities and sizes. These were generated from around ~700 newly sequenced genomes and ~600 circular elements (plasmids, viruses, other circular elements) not included in public databases during the challenge. These are now available here, together with gold standards for assembly, genome and taxonomic binning and taxonomic profiling, the underlying genome sequences, NCBI and ARB reference sequences snapshots from before the challenge and the reference NCBI taxonomy used. In addition, 3 test (toy) data sets are provided that were simulated from public genomes before the challenge. For the most realistic evaluation of reference based methods on the challenge data sets, usually taxonomic binners and profilers, the provided reference sequences or other sequence collections from before challenge should be used as references, as by now all underlying genomes have been deposited at NCBI or EBI.

Sample
Files
History

Click on a table column to sort the results.

Table Settings

Sample ID	Common Name	Scientific Name	Sample Attributes	Taxonomic ID
CAMI_low	synthetic metagenome	synthetic metagenome	Description:a 15 Gb single sample dataset from a low complexity community with log normal abundance distribution (40 genomes and 20 circular elements; not included in the reference sequence collections also provided in this archive)	1235509
CAMI_medium	synthetic metagenome	synthetic metagenome	Description:a 40 Gb differential log normal abundance dataset with two samples of a medium complexity community (132 genomes and 100 circular elements; not included in the reference sequence collections also provided in this archive) and long and short insert sizes	1235509
CAMI_high	synthetic metagenome	synthetic metagenome	Description:a 75 Gb time series dataset with five samples from a high complexity community with correlated log normal abundance distributions (596 genomes and 478 circular elements; not included in the reference sequence collections also provided in this archive)	1235509
CAMI_TOY_low	synthetic metagenome	synthetic metagenome	Description:a toy data set simulated from public genomes. Can be used for testing tools (gold standards provided). THIS IS NOT A CHALLENGE DATA SET. Genomes: 30 (included in the reference sequence collections also provided in this archive), Total Size: 15 Gbp, Read length: 2x100 bp, Insert size mean: 180 bp, Insert size stddev: 10%.	1235509
CAMI_TOY_medium	synthetic metagenome	synthetic metagenome	Description:a toy data set simulated from public genomes. Can be used for testing tools (gold standards provided). THIS IS NOT A CHALLENGE DATA SET. Two samples, differential abundance 2 Hiseq (small insert size) differential abundance 15 Gbp samples from 225 genomes (included in the reference sequence collections also provided in this archive). From the same two differential abundance community profiles, 2 Hiseq (5kb insert size) 0.75 Gbp samples	1235509
CAMI_TOY_high	synthetic metagenome	synthetic metagenome	Description:a toy data set simulated from public genomes. Can be used for testing tools (gold standards provided). THIS IS NOT A CHALLENGE DATA SET. 5 Hiseq (small insert size) 15 Gbp samples (time series) from 450 genomes (included in the reference sequence collections also provided in this archive) 15 Giga base pairs (each sample) Insert size mean: 180 bp Insert size stddev: 18 bp Read length: 2x100 bp	1235509

Click on a table column to sort the results.

Table Settings

File Name	Description	Sample ID	Data Type	File Format	Size	Release Date
readme_100344.txt	Readme	TEXT	5.72 kB	2019-02-26	MD5 checksum: af5136c9a5e26b4a11b10233da72de6a
CAMI_high.tar	This tar-ball includes the CAMI_high challenge data set and associated information, such as the FASTQ file of the reads for the individual samples, the gold standard assembly, binning and profiling and the genomes from which the samples were simulated.	Mixed archive	TAR	53.34 GB	2017-08-11	MD5 checksum: d7a3a64d9d461ddde15615d1cc71aab7
CAMI_low.tar	This tar-ball includes the CAMI_low challenge data set and associated information, such as the FASTQ file of the reads for the individual samples, the gold standard assembly, binning and profiling and the genomes from which the samples were simulated.	Mixed archive	TAR	11.25 GB	2017-08-11	MD5 checksum: 0c2259e190d308a3f3006910f5086632
CAMI_medium.tar	This tar-ball includes the CAMI_medium challenge data set and associated information, such as the FASTQ file of the reads for the individual samples, the gold standard assembly, binning and profiling and the genomes from which the samples were simulated.	Mixed archive	TAR	28.41 GB	2017-08-11	MD5 checksum: f15dfeadbe7536ad410ba8f3129ae746
taxonomy.tar.gz	Taxonomy database as of 2015/06/22 to be used for CAMI challenge datasets	Mixed archive	TAR	26.47 MB	2017-08-11	MD5 checksum: 7752be09d97662b48e10b501901d418d
camiClient_taxdb.tar.gz	Taxonomy database to be used for cami upload client	Mixed archive	TAR	94.65 MB	2017-08-11	MD5 checksum: 238b36b6e8febd068b35715e655f9182
225_genomes.tar	This tar-ball includes all CAMI_TOY_medium data sets, such as the FASTQ file of the reads, the gold standard assembly, binning and profiling.	Mixed archive	TAR	33.56 GB	2017-08-11	MD5 checksum: 4cc0db87294a8b09b7615ae4b591881a
30_genomes.tar	This tar-ball includes all CAMI_TOY_low data sets, such as the FASTQ file of the reads, the gold standard assembly, binning and profiling.	Mixed archive	TAR	15.41 GB	2017-08-11	MD5 checksum: eb14193df45fe4fa9901202269ad573d
450_genomes.tar	This tar-ball includes all CAMI_TOY_high data sets, such as the FASTQ file of the reads, the gold standard assembly, binning and profiling.	Mixed archive	TAR	80.53 GB	2017-08-11	MD5 checksum: 6815d7c5f74942484a2cc95bb878f1a7
PROCESSED_NCBI.tar	This tar-ball is a copy of the NCBI Refseq and Taxonomy Database as of 2015/06/22. This database should be used as a basis for reference based binning and profiling tools for the CAMI challenge datasets.	Mixed archive	TAR	158.03 GB	2017-08-11	MD5 checksum: d7f20a92c76458cc885344f52a28b1d4

Date	Action
August 11, 2017	Dataset publish
February 26, 2019	File program_results.tar.gz updated
February 26, 2019	File readme_100344.txt updated
February 26, 2019	readme_100344.txt: file attribute updated
February 26, 2019	File readme_100344.txt updated
February 26, 2019	program_results.tar.gz: additional file attribute added
February 26, 2019	File program_results.tar.gz updated
June 14, 2023	Relationship added : DOI 102408