The sequence read archive: explosive growth of sequencing data (original) (raw)
Journal Article
,
1Center for Information Biology and DNA Data Bank of Japan, National Institute of Genetics, Research Organization of Information and Systems, Yata, Mishima 411-8540, Japan, 2National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA and 3European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
*To whom correspondence should be addressed. Tel: +81 55 981 6853; Fax:
+81 55 981 6849
; Email: ykodama@genes.nig.ac.jp
Search for other works by this author on:
,
1Center for Information Biology and DNA Data Bank of Japan, National Institute of Genetics, Research Organization of Information and Systems, Yata, Mishima 411-8540, Japan, 2National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA and 3European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
Search for other works by this author on:
on behalf of the International Nucleotide Sequence Database Collaboration
1Center for Information Biology and DNA Data Bank of Japan, National Institute of Genetics, Research Organization of Information and Systems, Yata, Mishima 411-8540, Japan, 2National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA and 3European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
Search for other works by this author on:
Received:
15 September 2011
Accepted:
23 September 2011
Published:
18 October 2011
Cite
Yuichi Kodama, Martin Shumway, Rasko Leinonen, on behalf of the International Nucleotide Sequence Database Collaboration, The sequence read archive: explosive growth of sequencing data, Nucleic Acids Research, Volume 40, Issue D1, 1 January 2012, Pages D54–D56, https://doi.org/10.1093/nar/gkr854
Close
Navbar Search Filter Mobile Enter search term Search
Abstract
New generation sequencing platforms are producing data with significantly higher throughput and lower cost. A portion of this capacity is devoted to individual and community scientific projects. As these projects reach publication, raw sequencing datasets are submitted into the primary next-generation sequence data archive, the Sequence Read Archive (SRA). Archiving experimental data is the key to the progress of reproducible science. The SRA was established as a public repository for next-generation sequence data as a part of the International Nucleotide Sequence Database Collaboration (INSDC). INSDC is composed of the National Center for Biotechnology Information (NCBI), the European Bioinformatics Institute (EBI) and the DNA Data Bank of Japan (DDBJ). The SRA is accessible at www.ncbi.nlm.nih.gov/sra from NCBI, at www.ebi.ac.uk/ena from EBI and at trace.ddbj.nig.ac.jp from DDBJ. In this article, we present the content and structure of the SRA and report on updated metadata structures, submission file formats and supported sequencing platforms. We also briefly outline our various responses to the challenge of explosive data growth.
THE SEQUENCE READ ARCHIVE
Massively parallel next-generation sequencing platforms are revolutionizing life sciences. These instruments are producing vastly more sequence data than that was ever possible with the capillary technology. National Center for Biotechnology Information (NCBI) started the archive of raw sequencing data from next-generation platforms in 2007, followed by European Bioinformatics Institute (EBI) and DNA Data Bank of Japan (DDBJ) in 2008. In 2009, an international public archival resource ‘Sequence Read Archive (SRA)’ for next-generation sequencing data was established as a part of the International Nucleotide Sequence Database Collaboration (INSDC) (1–3). The mission of the SRA is to help the wider research community gain access to the next generation sequencing data emanating from scientific research. The SRA works as a core infrastructure for sharing of pre-publication sequence data as required by several large-scale international projects including the Human Microbiome project (https://commonfund.nih.gov/hmp) and 1000 Genomes project (http://www.1000genomes.org). It is to be noted that data requiring authorized access, such as human genome sequenced under ethical consent agreements, should be submitted to the database of phenotypes and genotypes at NCBI (dbGaP, http://www.ncbi.nlm.nih.gov/gap) or to the European Genome-phenome Archive at EBI (EGA, http://www.ebi.ac.uk/ega). Data submitted to dbGaP or EGA is not part of the public SRA. However, summary-level metadata is made available through SRA.
CONTENT
In 2011 the SRA surpassed 100 Terabases of open-access genetic sequence reads from next generation sequencing technologies. The Illumina™ platform comprises 84% of sequenced bases, with SOLiD™ and Roche/454™ platforms accounting for 12% and 2%, respectively. The most active SRA submitters in terms of submitted bases are the Broad Institute, the Wellcome Trust Sanger Institute and Baylor College of Medicine with 31, 13 and 11%, respectively. The largest individual global project generating next-generation sequence is the 1000 Genomes project which has contributed nearly one third of all bases. The most sequenced organisms are Homo sapiens with 61%, human metagenome with 6% and Mus musculus with 5% share of all bases. The common study types in terms of sequenced bases are Whole Genome Sequencing and Re-sequencing, Population Genomics, Metagenomics and Epigenetics with 57, 12, 11 and 8% share of all bases, respectively.
ACCEPTED DATA
The SRA is a repository of raw sequence data with the aim to balance the cost of long-term archival with the requirement to store sufficient information to support re-use of the submitted data. At minimum, data submitted to SRA must include base or SOLiD color calls and their qualities. To limit the archival cost and guided by community consultation, the SRA also sets maximum levels for accepted raw data. For example, since the end of 2010 signal data from the Illumina and SOLiD platforms are no longer archived by the SRA. In addition to base calls (or SOLiD color calls) and quality scores, SRA also accepts alignments submissions in BAM (4) format. Other data may be accepted as well; full details are available for submitters from NCBI, DDBJ or EBI. Interactive and pipeline submission routes to the SRA archives are available. Functional genomics studies using next-generation sequencing (e.g. ChIP-seq and RNA-seq) can be submitted via the Gene Expression Omnibus at NCBI (http://www.ncbi.nlm.nih.gov/geo) (5), ArrayExpress at EBI (http://www.ebi.ac.uk/arrayexpress) (6) and DDBJ Omics Archive (http://trace.ddbj.nig.ac.jp/dor) (7).
SUPPORTED PLATFORMS AND FILE FORMATS
The SRA aims to support all established and emerging sequencing platforms and most commonly used data file formats. Supported platforms include Roche/454 (Roche Diagnostics Corp.), Illumina (Illumina Inc.), SOLiD (Life Technologies Corp.), HeliScope™ Single Molecule Sequencer (Helicos Biosciences Corp.), Complete Genomics™ (Complete Genomics Inc.), SMRT™ (Pacific Biosciences Inc.) and Ion Torrent PGM™ (Life Technologies Corp.). Depending on the data file format, submissions for emerging platforms may be first supported only provisionally where submitted data is made available only in the original submitted format. This procedure guarantees early access to data generated by new platforms. Data submitted in any of the widely used data formats is rigorously validated and made available to pubic in a variety of formats. For example, NCBI makes data available in the NCBI SRA toolkit format, which can be converted into many other file formats, while EBI and DDBJ make data available in the FASTQ format. Recommended data submission formats may vary slightly between DDBJ, EBI and NCBI, but all widely used formats, such as BAM and Standard Flowgram Format (SFF), are universally accepted.
METADATA MODEL
Data submitted to SRA is organized using a metadata model consisting of six objects: study, sample, experiment, run, analysis and submission. The SRA study contains high-level information including goals of the study and literature references, and may be linked to the INSDC BioProject database. Similarly, the SRA sample object contains detailed sample information, and may be linked to the BioSample databases of NCBI (http://www.ncbi.nlm.nih.gov/biosample) and EBI (http://www.ebi.ac.uk/biosamples). The SRA experiment and run objects contain library and instrument information and are directly associated with the sequence data. The SRA analysis object is used for the deposition of a variety of analysis results including alignments and assemblies. The SRA submission object groups the other objects for submission into the SRA. These metadata objects are all accessioned with unique permanent identifiers that are shared by INSDC partners.
The SRA has updated the metadata model to better represent new sequencing technologies and applications. The schema version 1.3 introduced in 2011 added a new structure called GapDescriptor that will encode the placement of spot subsequences (tags) against a reference or assembly substrate. This structure encodes mate pair gaps and tandem read gaps. Introduction of the GapDescriptor element was motivated by the need to describe Complete Genomics platform sequencing. The next planned metadata version, 2.0, will largely simplify the model by removing redundant and deprecated fields. While this new model will be incompatible with the previous version, the SRA archives will transform all existing metadata documents to conform to the new model. The SRA metadata model is largely shared by all three archives, however, small differences have been introduced to support archive specific local requirements.
SEQUENCE DATA EXCHANGE
The public sequence data are exchanged between the INSDC partners allowing all public data to be accessed at each site regardless of the point of the original submission (‘submit locally, share globally’ model). The data is currently exchanged in the NCBI SRA toolkit format (http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software). The SRA toolkit provides a configurable storage and compression architecture and its format can be converted to other formats such as widely-used FASTQ through its standard API. The SRA data exchange model follows the long established INSDC policy of exchanging GenBank, EMBL-Bank and DDBJ entries.
CHALLENGE OF DATA GROWTH
The explosive growth of next-generation sequencing data submitted into the SRA exceeds the growth rate of storage capacity. This trend provides the greatest challenge to handle raw sequence data for SRA archives and users of the raw sequence data alike. The SRA partners actively discuss and pursue approaches together with user communities to maximize the benefit gained from archiving next-generation sequencing data while minimizing the infrastructure costs. Possible approaches discussed include reference-based compression of sequencing data, quantization of base quality values, selective storage of base quality values, reducing the metadata stored for individual reads (e.g. read names), federation of data in place of data submission and exchange, and consolidation of catastrophe back-up storage across SRA archives. Among these possibilities, SRA is exploring approaches based on reference alignment and compression of reads, and on the preservation of only the most valuable base quality information (8), and is also actively participating in experiments assessing the effect of quality score quantization. The SRA partners continue actively to discuss with the research community to explore appropriate data reduction approaches.
FUNDING
DNA Data Bank of Japan, Ministry of Education, Culture, Sports, Science and Technology of Japan; European Molecular Biology Laboratory, European Commission and the Wellcome Trust; National Library of Medicine; Intramural Research Program of the NIH. Funding for open access charge: Ministry of Education, Culture, Sports, Science and Technology of Japan (management expense grant).
Conflict of interest statement. None declared.
REFERENCES
1
, , .
Archiving next generation sequencing data
,
Nucleic Acids Res.
,
2010
, vol.
38
(pg.
D870
-
D871
)
2
, , .
The sequence read archive
,
Nucleic Acids Res.
,
2011
, vol.
39
(pg.
D19
-
D21
)
3
, , .
The International Nucleotide Sequence Database Collaboration
,
Nucleic Acids Res
,
2012
, vol.
40
(pg.
D33
-
D37
)
4
, , , , , , , , ,
1000 Genome Project Data Processing Subgroup
.
The Sequence Alignment/Map format and SAMtools
,
Bioinformatics.
,
2009
, vol.
25
(pg.
2078
-
2079
)
5
, , , , , , , , , , et al.
NCBI GEO: archive for functional genomics data sets–10 years on
,
Nucleic Acids Res.
,
2011
, vol.
39
(pg.
D1005
-
D1010
)
6
, , , , , , , , , , et al.
ArrayExpress update–an archive of microarray and high-throughput sequencing-based functional genomics experiments
,
Nucleic Acids Res.
,
2011
, vol.
39
(pg.
D1002
-
D1004
)
7
, , , , , , , .
The DNA Data Bank of Japan launches a new resource DDBJ Omics Archive of functional genomics experiments
,
Nucleic Acids Res.
in press
8
, , , .
Efficient storage of high throughput DNA sequencing data using reference-based compression
,
Genome Res.
,
2011
, vol.
21
(pg.
734
-
740
)
© The Author(s) 2011. Published by Oxford University Press.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
I agree to the terms and conditions. You must accept the terms and conditions.
Submit a comment
Name
Affiliations
Comment title
Comment
You have entered an invalid code
Thank you for submitting a comment on this article. Your comment will be reviewed and published at the journal's discretion. Please check for further notifications by email.
Citations
Views
Altmetric
Metrics
Total Views 8,148
6,252 Pageviews
1,896 PDF Downloads
Since 1/1/2017
Month: | Total Views: |
---|---|
January 2017 | 5 |
February 2017 | 19 |
March 2017 | 28 |
April 2017 | 25 |
May 2017 | 38 |
June 2017 | 12 |
July 2017 | 22 |
August 2017 | 24 |
September 2017 | 20 |
October 2017 | 22 |
November 2017 | 24 |
December 2017 | 52 |
January 2018 | 56 |
February 2018 | 80 |
March 2018 | 93 |
April 2018 | 79 |
May 2018 | 92 |
June 2018 | 54 |
July 2018 | 87 |
August 2018 | 75 |
September 2018 | 71 |
October 2018 | 57 |
November 2018 | 82 |
December 2018 | 79 |
January 2019 | 70 |
February 2019 | 61 |
March 2019 | 80 |
April 2019 | 94 |
May 2019 | 80 |
June 2019 | 82 |
July 2019 | 95 |
August 2019 | 101 |
September 2019 | 142 |
October 2019 | 93 |
November 2019 | 99 |
December 2019 | 80 |
January 2020 | 92 |
February 2020 | 85 |
March 2020 | 80 |
April 2020 | 52 |
May 2020 | 60 |
June 2020 | 79 |
July 2020 | 77 |
August 2020 | 97 |
September 2020 | 103 |
October 2020 | 151 |
November 2020 | 112 |
December 2020 | 90 |
January 2021 | 96 |
February 2021 | 118 |
March 2021 | 125 |
April 2021 | 98 |
May 2021 | 173 |
June 2021 | 134 |
July 2021 | 90 |
August 2021 | 110 |
September 2021 | 129 |
October 2021 | 115 |
November 2021 | 131 |
December 2021 | 88 |
January 2022 | 86 |
February 2022 | 98 |
March 2022 | 143 |
April 2022 | 118 |
May 2022 | 128 |
June 2022 | 100 |
July 2022 | 106 |
August 2022 | 73 |
September 2022 | 80 |
October 2022 | 68 |
November 2022 | 86 |
December 2022 | 93 |
January 2023 | 92 |
February 2023 | 116 |
March 2023 | 92 |
April 2023 | 100 |
May 2023 | 95 |
June 2023 | 64 |
July 2023 | 85 |
August 2023 | 82 |
September 2023 | 118 |
October 2023 | 113 |
November 2023 | 61 |
December 2023 | 92 |
January 2024 | 143 |
February 2024 | 157 |
March 2024 | 163 |
April 2024 | 100 |
May 2024 | 171 |
June 2024 | 111 |
July 2024 | 104 |
August 2024 | 109 |
September 2024 | 43 |
Citations
625 Web of Science
×
Email alerts
Citing articles via
More from Oxford Academic