The sequence read archive: explosive growth of sequencing data (original) (raw)

Journal Article

,

1Center for Information Biology and DNA Data Bank of Japan, National Institute of Genetics, Research Organization of Information and Systems, Yata, Mishima 411-8540, Japan, 2National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA and 3European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK

*To whom correspondence should be addressed. Tel: +81 55 981 6853; Fax:

+81 55 981 6849

; Email: ykodama@genes.nig.ac.jp

Search for other works by this author on:

,

1Center for Information Biology and DNA Data Bank of Japan, National Institute of Genetics, Research Organization of Information and Systems, Yata, Mishima 411-8540, Japan, 2National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA and 3European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK

Search for other works by this author on:

on behalf of the International Nucleotide Sequence Database Collaboration

1Center for Information Biology and DNA Data Bank of Japan, National Institute of Genetics, Research Organization of Information and Systems, Yata, Mishima 411-8540, Japan, 2National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA and 3European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK

Search for other works by this author on:

Received:

15 September 2011

Accepted:

23 September 2011

Published:

18 October 2011

Cite

Yuichi Kodama, Martin Shumway, Rasko Leinonen, on behalf of the International Nucleotide Sequence Database Collaboration, The sequence read archive: explosive growth of sequencing data, Nucleic Acids Research, Volume 40, Issue D1, 1 January 2012, Pages D54–D56, https://doi.org/10.1093/nar/gkr854
Close

Navbar Search Filter Mobile Enter search term Search

Abstract

New generation sequencing platforms are producing data with significantly higher throughput and lower cost. A portion of this capacity is devoted to individual and community scientific projects. As these projects reach publication, raw sequencing datasets are submitted into the primary next-generation sequence data archive, the Sequence Read Archive (SRA). Archiving experimental data is the key to the progress of reproducible science. The SRA was established as a public repository for next-generation sequence data as a part of the International Nucleotide Sequence Database Collaboration (INSDC). INSDC is composed of the National Center for Biotechnology Information (NCBI), the European Bioinformatics Institute (EBI) and the DNA Data Bank of Japan (DDBJ). The SRA is accessible at www.ncbi.nlm.nih.gov/sra from NCBI, at www.ebi.ac.uk/ena from EBI and at trace.ddbj.nig.ac.jp from DDBJ. In this article, we present the content and structure of the SRA and report on updated metadata structures, submission file formats and supported sequencing platforms. We also briefly outline our various responses to the challenge of explosive data growth.

THE SEQUENCE READ ARCHIVE

Massively parallel next-generation sequencing platforms are revolutionizing life sciences. These instruments are producing vastly more sequence data than that was ever possible with the capillary technology. National Center for Biotechnology Information (NCBI) started the archive of raw sequencing data from next-generation platforms in 2007, followed by European Bioinformatics Institute (EBI) and DNA Data Bank of Japan (DDBJ) in 2008. In 2009, an international public archival resource ‘Sequence Read Archive (SRA)’ for next-generation sequencing data was established as a part of the International Nucleotide Sequence Database Collaboration (INSDC) (1–3). The mission of the SRA is to help the wider research community gain access to the next generation sequencing data emanating from scientific research. The SRA works as a core infrastructure for sharing of pre-publication sequence data as required by several large-scale international projects including the Human Microbiome project (https://commonfund.nih.gov/hmp) and 1000 Genomes project (http://www.1000genomes.org). It is to be noted that data requiring authorized access, such as human genome sequenced under ethical consent agreements, should be submitted to the database of phenotypes and genotypes at NCBI (dbGaP, http://www.ncbi.nlm.nih.gov/gap) or to the European Genome-phenome Archive at EBI (EGA, http://www.ebi.ac.uk/ega). Data submitted to dbGaP or EGA is not part of the public SRA. However, summary-level metadata is made available through SRA.

CONTENT

In 2011 the SRA surpassed 100 Terabases of open-access genetic sequence reads from next generation sequencing technologies. The Illumina™ platform comprises 84% of sequenced bases, with SOLiD™ and Roche/454™ platforms accounting for 12% and 2%, respectively. The most active SRA submitters in terms of submitted bases are the Broad Institute, the Wellcome Trust Sanger Institute and Baylor College of Medicine with 31, 13 and 11%, respectively. The largest individual global project generating next-generation sequence is the 1000 Genomes project which has contributed nearly one third of all bases. The most sequenced organisms are Homo sapiens with 61%, human metagenome with 6% and Mus musculus with 5% share of all bases. The common study types in terms of sequenced bases are Whole Genome Sequencing and Re-sequencing, Population Genomics, Metagenomics and Epigenetics with 57, 12, 11 and 8% share of all bases, respectively.

ACCEPTED DATA

The SRA is a repository of raw sequence data with the aim to balance the cost of long-term archival with the requirement to store sufficient information to support re-use of the submitted data. At minimum, data submitted to SRA must include base or SOLiD color calls and their qualities. To limit the archival cost and guided by community consultation, the SRA also sets maximum levels for accepted raw data. For example, since the end of 2010 signal data from the Illumina and SOLiD platforms are no longer archived by the SRA. In addition to base calls (or SOLiD color calls) and quality scores, SRA also accepts alignments submissions in BAM (4) format. Other data may be accepted as well; full details are available for submitters from NCBI, DDBJ or EBI. Interactive and pipeline submission routes to the SRA archives are available. Functional genomics studies using next-generation sequencing (e.g. ChIP-seq and RNA-seq) can be submitted via the Gene Expression Omnibus at NCBI (http://www.ncbi.nlm.nih.gov/geo) (5), ArrayExpress at EBI (http://www.ebi.ac.uk/arrayexpress) (6) and DDBJ Omics Archive (http://trace.ddbj.nig.ac.jp/dor) (7).

SUPPORTED PLATFORMS AND FILE FORMATS

The SRA aims to support all established and emerging sequencing platforms and most commonly used data file formats. Supported platforms include Roche/454 (Roche Diagnostics Corp.), Illumina (Illumina Inc.), SOLiD (Life Technologies Corp.), HeliScope™ Single Molecule Sequencer (Helicos Biosciences Corp.), Complete Genomics™ (Complete Genomics Inc.), SMRT™ (Pacific Biosciences Inc.) and Ion Torrent PGM™ (Life Technologies Corp.). Depending on the data file format, submissions for emerging platforms may be first supported only provisionally where submitted data is made available only in the original submitted format. This procedure guarantees early access to data generated by new platforms. Data submitted in any of the widely used data formats is rigorously validated and made available to pubic in a variety of formats. For example, NCBI makes data available in the NCBI SRA toolkit format, which can be converted into many other file formats, while EBI and DDBJ make data available in the FASTQ format. Recommended data submission formats may vary slightly between DDBJ, EBI and NCBI, but all widely used formats, such as BAM and Standard Flowgram Format (SFF), are universally accepted.

METADATA MODEL

Data submitted to SRA is organized using a metadata model consisting of six objects: study, sample, experiment, run, analysis and submission. The SRA study contains high-level information including goals of the study and literature references, and may be linked to the INSDC BioProject database. Similarly, the SRA sample object contains detailed sample information, and may be linked to the BioSample databases of NCBI (http://www.ncbi.nlm.nih.gov/biosample) and EBI (http://www.ebi.ac.uk/biosamples). The SRA experiment and run objects contain library and instrument information and are directly associated with the sequence data. The SRA analysis object is used for the deposition of a variety of analysis results including alignments and assemblies. The SRA submission object groups the other objects for submission into the SRA. These metadata objects are all accessioned with unique permanent identifiers that are shared by INSDC partners.

The SRA has updated the metadata model to better represent new sequencing technologies and applications. The schema version 1.3 introduced in 2011 added a new structure called GapDescriptor that will encode the placement of spot subsequences (tags) against a reference or assembly substrate. This structure encodes mate pair gaps and tandem read gaps. Introduction of the GapDescriptor element was motivated by the need to describe Complete Genomics platform sequencing. The next planned metadata version, 2.0, will largely simplify the model by removing redundant and deprecated fields. While this new model will be incompatible with the previous version, the SRA archives will transform all existing metadata documents to conform to the new model. The SRA metadata model is largely shared by all three archives, however, small differences have been introduced to support archive specific local requirements.

SEQUENCE DATA EXCHANGE

The public sequence data are exchanged between the INSDC partners allowing all public data to be accessed at each site regardless of the point of the original submission (‘submit locally, share globally’ model). The data is currently exchanged in the NCBI SRA toolkit format (http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software). The SRA toolkit provides a configurable storage and compression architecture and its format can be converted to other formats such as widely-used FASTQ through its standard API. The SRA data exchange model follows the long established INSDC policy of exchanging GenBank, EMBL-Bank and DDBJ entries.

CHALLENGE OF DATA GROWTH

The explosive growth of next-generation sequencing data submitted into the SRA exceeds the growth rate of storage capacity. This trend provides the greatest challenge to handle raw sequence data for SRA archives and users of the raw sequence data alike. The SRA partners actively discuss and pursue approaches together with user communities to maximize the benefit gained from archiving next-generation sequencing data while minimizing the infrastructure costs. Possible approaches discussed include reference-based compression of sequencing data, quantization of base quality values, selective storage of base quality values, reducing the metadata stored for individual reads (e.g. read names), federation of data in place of data submission and exchange, and consolidation of catastrophe back-up storage across SRA archives. Among these possibilities, SRA is exploring approaches based on reference alignment and compression of reads, and on the preservation of only the most valuable base quality information (8), and is also actively participating in experiments assessing the effect of quality score quantization. The SRA partners continue actively to discuss with the research community to explore appropriate data reduction approaches.

FUNDING

DNA Data Bank of Japan, Ministry of Education, Culture, Sports, Science and Technology of Japan; European Molecular Biology Laboratory, European Commission and the Wellcome Trust; National Library of Medicine; Intramural Research Program of the NIH. Funding for open access charge: Ministry of Education, Culture, Sports, Science and Technology of Japan (management expense grant).

Conflict of interest statement. None declared.

REFERENCES

1

, , .

Archiving next generation sequencing data

,

Nucleic Acids Res.

,

2010

, vol.

38

(pg.

D870

-

D871

)

2

, , .

The sequence read archive

,

Nucleic Acids Res.

,

2011

, vol.

39

(pg.

D19

-

D21

)

3

, , .

The International Nucleotide Sequence Database Collaboration

,

Nucleic Acids Res

,

2012

, vol.

40

(pg.

D33

-

D37

)

4

, , , , , , , , ,

1000 Genome Project Data Processing Subgroup

.

The Sequence Alignment/Map format and SAMtools

,

Bioinformatics.

,

2009

, vol.

25

(pg.

2078

-

2079

)

5

, , , , , , , , , , et al.

NCBI GEO: archive for functional genomics data sets–10 years on

,

Nucleic Acids Res.

,

2011

, vol.

39

(pg.

D1005

-

D1010

)

6

, , , , , , , , , , et al.

ArrayExpress update–an archive of microarray and high-throughput sequencing-based functional genomics experiments

,

Nucleic Acids Res.

,

2011

, vol.

39

(pg.

D1002

-

D1004

)

7

, , , , , , , .

The DNA Data Bank of Japan launches a new resource DDBJ Omics Archive of functional genomics experiments

,

Nucleic Acids Res.

in press

8

, , , .

Efficient storage of high throughput DNA sequencing data using reference-based compression

,

Genome Res.

,

2011

, vol.

21

(pg.

734

-

740

)

© The Author(s) 2011. Published by Oxford University Press.

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

I agree to the terms and conditions. You must accept the terms and conditions.

Submit a comment

Name

Affiliations

Comment title

Comment

You have entered an invalid code

Thank you for submitting a comment on this article. Your comment will be reviewed and published at the journal's discretion. Please check for further notifications by email.

Citations

Views

Altmetric

Metrics

Total Views 8,148

6,252 Pageviews

1,896 PDF Downloads

Since 1/1/2017

Month: Total Views:
January 2017 5
February 2017 19
March 2017 28
April 2017 25
May 2017 38
June 2017 12
July 2017 22
August 2017 24
September 2017 20
October 2017 22
November 2017 24
December 2017 52
January 2018 56
February 2018 80
March 2018 93
April 2018 79
May 2018 92
June 2018 54
July 2018 87
August 2018 75
September 2018 71
October 2018 57
November 2018 82
December 2018 79
January 2019 70
February 2019 61
March 2019 80
April 2019 94
May 2019 80
June 2019 82
July 2019 95
August 2019 101
September 2019 142
October 2019 93
November 2019 99
December 2019 80
January 2020 92
February 2020 85
March 2020 80
April 2020 52
May 2020 60
June 2020 79
July 2020 77
August 2020 97
September 2020 103
October 2020 151
November 2020 112
December 2020 90
January 2021 96
February 2021 118
March 2021 125
April 2021 98
May 2021 173
June 2021 134
July 2021 90
August 2021 110
September 2021 129
October 2021 115
November 2021 131
December 2021 88
January 2022 86
February 2022 98
March 2022 143
April 2022 118
May 2022 128
June 2022 100
July 2022 106
August 2022 73
September 2022 80
October 2022 68
November 2022 86
December 2022 93
January 2023 92
February 2023 116
March 2023 92
April 2023 100
May 2023 95
June 2023 64
July 2023 85
August 2023 82
September 2023 118
October 2023 113
November 2023 61
December 2023 92
January 2024 143
February 2024 157
March 2024 163
April 2024 100
May 2024 171
June 2024 111
July 2024 104
August 2024 109
September 2024 43

Citations

625 Web of Science

×

Email alerts

Citing articles via

More from Oxford Academic