ArrayExpress update—trends in database growth and links to data analysis tools (original) (raw)
Journal Article
,
1Functional Genomics Team, EMBL-EBI, Wellcome Trust Genome Campus, Hinxton CB10 1SD and 2Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton CB10 1SA, UK
*To whom correspondence should be addressed. Tel: +44 1223 492539; Fax:
+44 1223 494468
; Email: gabry@ebi.ac.uk
Search for other works by this author on:
,
1Functional Genomics Team, EMBL-EBI, Wellcome Trust Genome Campus, Hinxton CB10 1SD and 2Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton CB10 1SA, UK
Search for other works by this author on:
,
1Functional Genomics Team, EMBL-EBI, Wellcome Trust Genome Campus, Hinxton CB10 1SD and 2Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton CB10 1SA, UK
Search for other works by this author on:
,
1Functional Genomics Team, EMBL-EBI, Wellcome Trust Genome Campus, Hinxton CB10 1SD and 2Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton CB10 1SA, UK
Search for other works by this author on:
,
1Functional Genomics Team, EMBL-EBI, Wellcome Trust Genome Campus, Hinxton CB10 1SD and 2Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton CB10 1SA, UK
Search for other works by this author on:
,
1Functional Genomics Team, EMBL-EBI, Wellcome Trust Genome Campus, Hinxton CB10 1SD and 2Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton CB10 1SA, UK
Search for other works by this author on:
,
1Functional Genomics Team, EMBL-EBI, Wellcome Trust Genome Campus, Hinxton CB10 1SD and 2Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton CB10 1SA, UK
Search for other works by this author on:
,
1Functional Genomics Team, EMBL-EBI, Wellcome Trust Genome Campus, Hinxton CB10 1SD and 2Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton CB10 1SA, UK
Search for other works by this author on:
,
1Functional Genomics Team, EMBL-EBI, Wellcome Trust Genome Campus, Hinxton CB10 1SD and 2Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton CB10 1SA, UK
Search for other works by this author on:
,
1Functional Genomics Team, EMBL-EBI, Wellcome Trust Genome Campus, Hinxton CB10 1SD and 2Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton CB10 1SA, UK
Search for other works by this author on:
Received:
19 October 2012
Revision received:
26 October 2012
Accepted:
28 October 2012
Published:
26 November 2012
Cite
Gabriella Rustici, Nikolay Kolesnikov, Marco Brandizi, Tony Burdett, Miroslaw Dylag, Ibrahim Emam, Anna Farne, Emma Hastings, Jon Ison, Maria Keays, Natalja Kurbatova, James Malone, Roby Mani, Annalisa Mupo, Rui Pedro Pereira, Ekaterina Pilicheva, Johan Rung, Anjan Sharma, Y. Amy Tang, Tobias Ternent, Andrew Tikhonov, Danielle Welter, Eleanor Williams, Alvis Brazma, Helen Parkinson, Ugis Sarkans, ArrayExpress update—trends in database growth and links to data analysis tools, Nucleic Acids Research, Volume 41, Issue D1, 1 January 2013, Pages D987–D990, https://doi.org/10.1093/nar/gks1174
Close
Navbar Search Filter Mobile Enter search term Search
Abstract
The ArrayExpress Archive of Functional Genomics Data (http://www.ebi.ac.uk/arrayexpress) is one of three international functional genomics public data repositories, alongside the Gene Expression Omnibus at NCBI and the DDBJ Omics Archive, supporting peer-reviewed publications. It accepts data generated by sequencing or array-based technologies and currently contains data from almost a million assays, from over 30 000 experiments. The proportion of sequencing-based submissions has grown significantly over the last 2 years and has reached, in 2012, 15% of all new data. All data are available from ArrayExpress in MAGE-TAB format, which allows robust linking to data analysis and visualization tools, including Bioconductor and GenomeSpace. Additionally, R objects, for microarray data, and binary alignment format files, for sequencing data, have been generated for a significant proportion of ArrayExpress data.
INTRODUCTION
The ArrayExpress Archive of Functional Genomics Data (1) is one of the major international repositories for functional genomics high throughput data, supporting publications as well as various data generating consortia. It stores functional genomics data derived from high throughput sequencing (HTS) and microarray-based experiments. Users come to ArrayExpress to (i) find functional genomics experiments that might be relevant to their research; (ii) retrieve information describing these experiments and the data associated with them; (iii) retrieve data for including in their own local data warehouses or added value databases; and (iv) submit their own data supporting a peer-reviewed publication.
Once submitted, data may be kept in ArrayExpress as private for a limited period of time, typically during the peer-review process of the related publication. Upon submission, an accession number is assigned to it and access to the data is restricted to providers/reviewers via a login system. The submitter specifies the release date and the data becomes public either when the accession number associated with the data is cited in a publication or at the set release date, whichever comes first.
All submissions are automatically checked for compliance to the Minimum Information About a Microarray Experiments (MIAME) (2) or Minimum Information about Sequencing Experiments (MINSEQE – http://www.fged.org/projects/minseqe/) guidelines, for microarray and sequencing-based experiments, respectively. The MIAME/MINSEQE scores associated with an experiment are displayed in the ArrayExpress interface and provided to submitters.
In addition to the data submitted directly to ArrayExpress, data from the Gene Expression Omnibus (GEO) (3) are imported to provide users with a single access to most of the functional genomics data available in the public domain. All data are organized, and available for download, in a structured and standardized format, MAGE-TAB (4), which also facilitates linking to open source analysis environments such as Bioconductor (5) and GenomeSpace (http://www.genomespace.org). A format conversion tool, from GEO SOFT to MAGE-TAB (6), is run on all GEO HTS and microarray data. The conversion is successful in 83% of cases; there are various reasons why this conversion may fail, including failure to parse SOFT files correctly or failure to retrieve the associated data files and we are constantly working with GEO to increase the success rate. All HTS data are exchanged with GEO and a data sharing agreement with the DDBJ Omics Archive is also in place (7).
For all experiments, the column labels describing the sample (e.g. disease) and its characteristics (e.g. type II diabetes) are mapped to the EBI's Experimental Factor Ontology (EFO) (8) and the data loaded into ArrayExpress. This allows consistent query results to be returned from direct submissions as well as imported data. As data are curated for Gene Expression Atlas use (9), they are reloaded into ArrayExpress with enriched annotation.
The ArrayExpress user interface allows users to search for experiments of interest by keywords and ontology terms, which enable semantically driven searches of the experimental metadata; for instance searching with the EFO term ‘cancer’ will also find experiments investigating ‘leukemia’ even if ‘cancer’ is not mentioned explicitly. Both US and UK spelling is supported.
DATA GROWTH TO A MILLION ASSAYS
Over the last 2 years, the database content has grown from 13 000 experiments and 370 000 assays, to over 30 000 experiments and almost a million assays. Approximately 20% of the data were submitted directly to ArrayExpress; the rest are imported from GEO weekly.
Although HTS-based experiments account for only 6% of the entire database content, the proportion of new HTS submissions has been growing exponentially over the last few years, from 2% in 2009 to 6% in 2010, 7% in 2011 and 15% in 2012. Nevertheless, the total number of assays associated with HTS-based experiments is still only 3%, reflecting the fact that HTS experiments are typically smaller than microarray-based experiments. If we look at a breakdown of the HTS data by application, 50% of the experiments used RNA-seq only, 32% ChIP-seq only and the remaining experiments either utilized more than one application or used DNA-seq for genotyping, copy number variation detection or methylation profiling.
For HTS data, ArrayExpress stores processed data and metadata describing the sample properties and the experimental design, including experimental variables and protocols, whereas raw sequence data are stored in the European nucleotide archive (ENA) (10) and linked from ArrayExpress. For datasets that require controlled access, the raw sequence data are stored in, and should be submitted directly to, the European Genome-phenome Archive (EGA – www.ebi.ac.uk/ega).
LINKS TO DATA ANALYSIS TOOLS
Approximately 50 GB of data are downloaded every day from ArrayExpress, by an average of 1000 different users. To simplify the interface between ArrayExpress and analytical platforms, we are now providing links to popular analytical tools such as Bioconductor and GenePattern (11), as well as developing robust internal pipelines for HTS data processing.
To facilitate loading microarray data from ArrayExpress into Bioconductor, we have pre-generated R objects for 16 250 out of 25 000 gene expression microarray experiments with raw data files available. A revised version of the Bioconductor package ArrayExpress (12) is used with default parameters. The package has been updated to support popular data formats including Affymetrix and Agilent. More than 85% of Affymetrix data in the repository have downloadable R objects. Older submissions, other technologies and experiments with only processed data available can still be loaded in R, but require user-specified settings for the package to recognize the data format, so loading must be supervised by a user. All pregenerated R objects are now available through the ArrayExpress interface and can be easily loaded into Bioconductor for downstream analysis. More R objects will be created for experiments in ArrayExpress as more data arrive, and the R package will be maintained and extended for this purpose.
Direct links are now provided to GenomeSpace (http://www.genomespace.org), a data analysis environment that makes it possible for users to move data smoothly between popular bioinformatics tools. From ArrayExpress, the user can, with a single click, load a dataset into GenomeSpace, provided that he/she has a registered account with GenomeSpace. Once logged in, the user will be able to utilize the data analysis tools available through GenomeSpace, including GenePattern, Galaxy (13) and Cytoscape (14), to perform data analysis.
For HTS data, the Bioconductor package ArrayExpressHTS (15) and the R-workbench (http://www.ebi.ac.uk/Tools/rcloud/) are used to generate binary alignment (BAM) format files (16). BAM files contain sequence alignment data and can be displayed using the Ensembl genome browser (17), through a direct link from ArrayExpress. So far approximately 1200 BAM files are available for 125 RNA-seq experiments, for 14 different species, with over half of these data studying human and a quarter mouse. The BAM file generation has been done for experiments for which: (i) the sample–data relationship information is available and contains details such as the library strategy and the experiment type (i.e. RNA-seq); (ii) the raw sequence reads (FASTQ files) are deposited in ENA and a valid link to the ENA entry is present; and (iii) the annotation for the reference genome is available in Ensembl.
In addition, 3000 datasets from ArrayExpress have been analysed and the results of this analysis are presented through the Gene Expression Atlas (9), a separate EBI database, which helps users to (i) find out whether the expression of a gene (or a group of genes with a common gene attribute, e.g. GO term) change(s) across all the experiments or (ii) discover which genes are differentially expressed in a particular biological condition of interest.
CONTINUOUS USER INTERFACE IMPROVEMENTS
The ArrayExpress user interface has been continuously improved since the repository was established in 2003 (18). Recent additions include the sample–data relationship viewer (Figure 1), which provides an overview of all samples used in an experiment and their characteristics, the experimental variables (factors) investigated and the data files associated with each sample.
Figure 1.
Sample–data relationship viewer for Experiment E-MTAB-513. This view provides information on sample characteristics and experimental variables that are fundamental to understand the results obtained in the experiment. Generally, each row corresponds to a sample. Columns include sample characteristics and their relationship to the resulting data files, providing a quick view over the structure of the experiment and the biological questions that the authors addressed. The last column provides links to raw sequence data files available in ENA, and BAM files that can be visualized in the Ensembl genome browser.
Other improvements include (i) improved array designs browsing and querying for; (ii) specific features for HTS data display; (iii) better organization of the species drop-down filter, and (iv) improved performance for retrieving and visualizing large experiments.
The ArrayExpress user documentation has recently been updated and several online courses, covering how to search, interpret and submit data to ArrayExpress, can be found on the EBI e-Learning portal, Train online (http://www.ebi.ac.uk/training/online/).
FUTURE DEVELOPMENTS
We are currently developing a new submission tool, optimized for supporting HTS data submissions; this new tool is based on the community developed annotation tool Annotare (19) and will be released in 2013.
Like all other major EBI data resources, ArrayExpress is working toward deeper integration in the overall EBI infrastructure, in particular with the BioSample Database (20), the Gene Expression Atlas and the sequence databases ENA, EGA and Ensembl. We will continue this integration effort to ensure that our users can obtain a systems level view of the data stored at EBI by easily navigating through our resources.
FUNDING
ArrayExpress and related activities are supported by member states of the European Molecular Biology Laboratory; European Commission: ENGAGE [201413], EurocanPlatform [260791], GEUVADIS [261123], SLING [226073], SYBARIS [242220], and Gen2Phen [200754]; US National Institutes of Health (the National Human Genome Research Institute, National Institute of Biomedical Imaging and Bioengineering and the National Cancer Institute) [P41 HG003619]; National Center for Biomedical Ontology, one of the National Centers for Biomedical Computing supported by the National Human Genome Research Institute, the National Heart, Lung, and Blood Institute, and National Institutes of Health Common Fund [U54-HG004028]; National Science Foundation Award Number [1127112]. Funding for open access charge: EMBL Members states.
Conflict of interest statement. None declared.
REFERENCES
1
et al.
ArrayExpress update—an archive of microarray and high-throughput sequencing-based functional genomics experiments
,
Nucleic Acids Res.
,
2011
, vol.
39
(pg.
D1002
-
D1004
)
2
et al.
Minimum information about a microarray experiment (MIAME)—toward standards for microarray data
,
Nat. Genetics
,
2001
, vol.
29
(pg.
365
-
371
)
3
et al.
NCBI GEO: archive for functional genomics data sets–10 years on
,
Nucleic Acids Res.
,
2011
, vol.
39
Suppl. 1
(pg.
D1005
-
D1010
)
4
et al.
A simple spreadsheet-based, MIAME-supportive format for microarray data: MAGE-TAB
,
BMC Bioinformatics
,
2006
, vol.
7
pg.
489
5
et al.
Bioconductor: open software development for computational biology and bioinformatics
,
Genome Biol.
,
2004
, vol.
5
pg.
R80
6
MAGETabulator, a suite of tools to support the microarray data format MAGE-TAB
,
Bioinformatics
,
2009
, vol.
25
(pg.
279
-
280
)
7
The DNA Data Bank of Japan launches a new resource, the DDBJ omics archive of functional genomics experiments
,
Nucleic Acids Res.
,
2012
, vol.
40
(pg.
D38
-
D42
)
8
Modeling sample variables with an experimental factor ontology
,
Bioinformatics
,
2010
, vol.
26
(pg.
1112
-
1118
)
9
et al.
Gene expression atlas update—a value-added database of microarray and sequencing-based functional genomics experiments
,
Nucleic Acids Res.
,
2012
, vol.
40
(pg.
D1077
-
D1081
)
10
et al.
Petabyte-scale innovations at the European nucleotide archive
,
Nucleic Acids Res.
,
2009
, vol.
37
(pg.
D19
-
D25
)
11
GenePattern 2.0
,
Nat. Genet.
,
2006
, vol.
38
(pg.
500
-
501
)
12
Importing ArrayExpress datasets into R/Bioconductor
,
Bioinformatics
,
2009
, vol.
25
(pg.
2092
-
2094
)
13
Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences
,
Genome Biol.
,
2010
, vol.
11
pg.
R86
14
Cytoscape: a software environment for integrated models of biomolecular interaction networks
,
Genome Res.
,
2003
, vol.
13
(pg.
2498
-
2504
)
15
A pipeline for RNA-seq data processing and quality assessment
,
Bioinformatics
,
2011
, vol.
27
(pg.
867
-
869
)
16
The sequence alignment/map format and SAMtools
,
Bioinformatics
,
2009
, vol.
25
(pg.
2078
-
2079
)
17
et al.
Ensembl 2012
,
Nucleic Acids Res.
,
2012
, vol.
40
(pg.
D84
-
D90
)
18
et al.
ArrayExpress—a public repository for microarray gene expression data at the EBI
,
Nucleic Acids Res.
,
2003
, vol.
31
(pg.
68
-
71
)
19
et al.
Annotare—a tool for annotating high-throughput biomedical investigations and resulting data
,
Bioinformatics
,
2010
, vol.
26
(pg.
2470
-
2471
)
20
The BioSample Database (BioSD) at the European Bioinformatics Institute
,
Nucleic Acids Res.
,
2012
, vol.
40
(pg.
D64
-
D70
)
© The Author(s) 2012. Published by Oxford University Press.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by-nc/3.0/), which permits non-commercial reuse, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com.
I agree to the terms and conditions. You must accept the terms and conditions.
Submit a comment
Name
Affiliations
Comment title
Comment
You have entered an invalid code
Thank you for submitting a comment on this article. Your comment will be reviewed and published at the journal's discretion. Please check for further notifications by email.
Citations
Views
Altmetric
Metrics
Total Views 3,315
2,629 Pageviews
686 PDF Downloads
Since 12/1/2016
Month: | Total Views: |
---|---|
December 2016 | 7 |
January 2017 | 2 |
February 2017 | 13 |
March 2017 | 15 |
April 2017 | 13 |
May 2017 | 14 |
June 2017 | 8 |
July 2017 | 8 |
August 2017 | 10 |
September 2017 | 11 |
October 2017 | 11 |
November 2017 | 13 |
December 2017 | 43 |
January 2018 | 33 |
February 2018 | 32 |
March 2018 | 43 |
April 2018 | 56 |
May 2018 | 25 |
June 2018 | 33 |
July 2018 | 27 |
August 2018 | 21 |
September 2018 | 27 |
October 2018 | 41 |
November 2018 | 29 |
December 2018 | 22 |
January 2019 | 19 |
February 2019 | 45 |
March 2019 | 56 |
April 2019 | 62 |
May 2019 | 44 |
June 2019 | 31 |
July 2019 | 27 |
August 2019 | 30 |
September 2019 | 24 |
October 2019 | 35 |
November 2019 | 31 |
December 2019 | 24 |
January 2020 | 32 |
February 2020 | 16 |
March 2020 | 37 |
April 2020 | 21 |
May 2020 | 15 |
June 2020 | 21 |
July 2020 | 27 |
August 2020 | 44 |
September 2020 | 50 |
October 2020 | 38 |
November 2020 | 40 |
December 2020 | 35 |
January 2021 | 37 |
February 2021 | 47 |
March 2021 | 65 |
April 2021 | 51 |
May 2021 | 52 |
June 2021 | 36 |
July 2021 | 27 |
August 2021 | 35 |
September 2021 | 40 |
October 2021 | 26 |
November 2021 | 50 |
December 2021 | 36 |
January 2022 | 40 |
February 2022 | 37 |
March 2022 | 45 |
April 2022 | 48 |
May 2022 | 39 |
June 2022 | 35 |
July 2022 | 43 |
August 2022 | 33 |
September 2022 | 50 |
October 2022 | 55 |
November 2022 | 41 |
December 2022 | 46 |
January 2023 | 37 |
February 2023 | 48 |
March 2023 | 49 |
April 2023 | 62 |
May 2023 | 40 |
June 2023 | 44 |
July 2023 | 30 |
August 2023 | 46 |
September 2023 | 29 |
October 2023 | 37 |
November 2023 | 36 |
December 2023 | 43 |
January 2024 | 47 |
February 2024 | 51 |
March 2024 | 72 |
April 2024 | 47 |
May 2024 | 48 |
June 2024 | 38 |
July 2024 | 28 |
August 2024 | 28 |
September 2024 | 37 |
October 2024 | 43 |
Citations
284 Web of Science
×
Email alerts
Citing articles via
More from Oxford Academic