ArrayExpress update—trends in database growth and links to data analysis tools (original) (raw)

Journal Article

1Functional Genomics Team, EMBL-EBI, Wellcome Trust Genome Campus, Hinxton CB10 1SD and 2Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton CB10 1SA, UK

*To whom correspondence should be addressed. Tel: +44 1223 492539; Fax:

+44 1223 494468

; Email: gabry@ebi.ac.uk

Search for other works by this author on:

1Functional Genomics Team, EMBL-EBI, Wellcome Trust Genome Campus, Hinxton CB10 1SD and 2Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton CB10 1SA, UK

Search for other works by this author on:

1Functional Genomics Team, EMBL-EBI, Wellcome Trust Genome Campus, Hinxton CB10 1SD and 2Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton CB10 1SA, UK

Search for other works by this author on:

1Functional Genomics Team, EMBL-EBI, Wellcome Trust Genome Campus, Hinxton CB10 1SD and 2Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton CB10 1SA, UK

Search for other works by this author on:

1Functional Genomics Team, EMBL-EBI, Wellcome Trust Genome Campus, Hinxton CB10 1SD and 2Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton CB10 1SA, UK

Search for other works by this author on:

1Functional Genomics Team, EMBL-EBI, Wellcome Trust Genome Campus, Hinxton CB10 1SD and 2Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton CB10 1SA, UK

Search for other works by this author on:

1Functional Genomics Team, EMBL-EBI, Wellcome Trust Genome Campus, Hinxton CB10 1SD and 2Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton CB10 1SA, UK

Search for other works by this author on:

1Functional Genomics Team, EMBL-EBI, Wellcome Trust Genome Campus, Hinxton CB10 1SD and 2Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton CB10 1SA, UK

Search for other works by this author on:

1Functional Genomics Team, EMBL-EBI, Wellcome Trust Genome Campus, Hinxton CB10 1SD and 2Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton CB10 1SA, UK

Search for other works by this author on:

1Functional Genomics Team, EMBL-EBI, Wellcome Trust Genome Campus, Hinxton CB10 1SD and 2Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton CB10 1SA, UK

Search for other works by this author on:

Received:

19 October 2012

Revision received:

26 October 2012

Accepted:

28 October 2012

Published:

26 November 2012

Cite

Gabriella Rustici, Nikolay Kolesnikov, Marco Brandizi, Tony Burdett, Miroslaw Dylag, Ibrahim Emam, Anna Farne, Emma Hastings, Jon Ison, Maria Keays, Natalja Kurbatova, James Malone, Roby Mani, Annalisa Mupo, Rui Pedro Pereira, Ekaterina Pilicheva, Johan Rung, Anjan Sharma, Y. Amy Tang, Tobias Ternent, Andrew Tikhonov, Danielle Welter, Eleanor Williams, Alvis Brazma, Helen Parkinson, Ugis Sarkans, ArrayExpress update—trends in database growth and links to data analysis tools, Nucleic Acids Research, Volume 41, Issue D1, 1 January 2013, Pages D987–D990, https://doi.org/10.1093/nar/gks1174
Close

Navbar Search Filter Mobile Enter search term Search

Abstract

The ArrayExpress Archive of Functional Genomics Data (http://www.ebi.ac.uk/arrayexpress) is one of three international functional genomics public data repositories, alongside the Gene Expression Omnibus at NCBI and the DDBJ Omics Archive, supporting peer-reviewed publications. It accepts data generated by sequencing or array-based technologies and currently contains data from almost a million assays, from over 30 000 experiments. The proportion of sequencing-based submissions has grown significantly over the last 2 years and has reached, in 2012, 15% of all new data. All data are available from ArrayExpress in MAGE-TAB format, which allows robust linking to data analysis and visualization tools, including Bioconductor and GenomeSpace. Additionally, R objects, for microarray data, and binary alignment format files, for sequencing data, have been generated for a significant proportion of ArrayExpress data.

INTRODUCTION

The ArrayExpress Archive of Functional Genomics Data (1) is one of the major international repositories for functional genomics high throughput data, supporting publications as well as various data generating consortia. It stores functional genomics data derived from high throughput sequencing (HTS) and microarray-based experiments. Users come to ArrayExpress to (i) find functional genomics experiments that might be relevant to their research; (ii) retrieve information describing these experiments and the data associated with them; (iii) retrieve data for including in their own local data warehouses or added value databases; and (iv) submit their own data supporting a peer-reviewed publication.

Once submitted, data may be kept in ArrayExpress as private for a limited period of time, typically during the peer-review process of the related publication. Upon submission, an accession number is assigned to it and access to the data is restricted to providers/reviewers via a login system. The submitter specifies the release date and the data becomes public either when the accession number associated with the data is cited in a publication or at the set release date, whichever comes first.

All submissions are automatically checked for compliance to the Minimum Information About a Microarray Experiments (MIAME) (2) or Minimum Information about Sequencing Experiments (MINSEQE – http://www.fged.org/projects/minseqe/) guidelines, for microarray and sequencing-based experiments, respectively. The MIAME/MINSEQE scores associated with an experiment are displayed in the ArrayExpress interface and provided to submitters.

In addition to the data submitted directly to ArrayExpress, data from the Gene Expression Omnibus (GEO) (3) are imported to provide users with a single access to most of the functional genomics data available in the public domain. All data are organized, and available for download, in a structured and standardized format, MAGE-TAB (4), which also facilitates linking to open source analysis environments such as Bioconductor (5) and GenomeSpace (http://www.genomespace.org). A format conversion tool, from GEO SOFT to MAGE-TAB (6), is run on all GEO HTS and microarray data. The conversion is successful in 83% of cases; there are various reasons why this conversion may fail, including failure to parse SOFT files correctly or failure to retrieve the associated data files and we are constantly working with GEO to increase the success rate. All HTS data are exchanged with GEO and a data sharing agreement with the DDBJ Omics Archive is also in place (7).

For all experiments, the column labels describing the sample (e.g. disease) and its characteristics (e.g. type II diabetes) are mapped to the EBI's Experimental Factor Ontology (EFO) (8) and the data loaded into ArrayExpress. This allows consistent query results to be returned from direct submissions as well as imported data. As data are curated for Gene Expression Atlas use (9), they are reloaded into ArrayExpress with enriched annotation.

The ArrayExpress user interface allows users to search for experiments of interest by keywords and ontology terms, which enable semantically driven searches of the experimental metadata; for instance searching with the EFO term ‘cancer’ will also find experiments investigating ‘leukemia’ even if ‘cancer’ is not mentioned explicitly. Both US and UK spelling is supported.

DATA GROWTH TO A MILLION ASSAYS

Over the last 2 years, the database content has grown from 13 000 experiments and 370 000 assays, to over 30 000 experiments and almost a million assays. Approximately 20% of the data were submitted directly to ArrayExpress; the rest are imported from GEO weekly.

Although HTS-based experiments account for only 6% of the entire database content, the proportion of new HTS submissions has been growing exponentially over the last few years, from 2% in 2009 to 6% in 2010, 7% in 2011 and 15% in 2012. Nevertheless, the total number of assays associated with HTS-based experiments is still only 3%, reflecting the fact that HTS experiments are typically smaller than microarray-based experiments. If we look at a breakdown of the HTS data by application, 50% of the experiments used RNA-seq only, 32% ChIP-seq only and the remaining experiments either utilized more than one application or used DNA-seq for genotyping, copy number variation detection or methylation profiling.

For HTS data, ArrayExpress stores processed data and metadata describing the sample properties and the experimental design, including experimental variables and protocols, whereas raw sequence data are stored in the European nucleotide archive (ENA) (10) and linked from ArrayExpress. For datasets that require controlled access, the raw sequence data are stored in, and should be submitted directly to, the European Genome-phenome Archive (EGA – www.ebi.ac.uk/ega).

LINKS TO DATA ANALYSIS TOOLS

Approximately 50 GB of data are downloaded every day from ArrayExpress, by an average of 1000 different users. To simplify the interface between ArrayExpress and analytical platforms, we are now providing links to popular analytical tools such as Bioconductor and GenePattern (11), as well as developing robust internal pipelines for HTS data processing.

To facilitate loading microarray data from ArrayExpress into Bioconductor, we have pre-generated R objects for 16 250 out of 25 000 gene expression microarray experiments with raw data files available. A revised version of the Bioconductor package ArrayExpress (12) is used with default parameters. The package has been updated to support popular data formats including Affymetrix and Agilent. More than 85% of Affymetrix data in the repository have downloadable R objects. Older submissions, other technologies and experiments with only processed data available can still be loaded in R, but require user-specified settings for the package to recognize the data format, so loading must be supervised by a user. All pregenerated R objects are now available through the ArrayExpress interface and can be easily loaded into Bioconductor for downstream analysis. More R objects will be created for experiments in ArrayExpress as more data arrive, and the R package will be maintained and extended for this purpose.

Direct links are now provided to GenomeSpace (http://www.genomespace.org), a data analysis environment that makes it possible for users to move data smoothly between popular bioinformatics tools. From ArrayExpress, the user can, with a single click, load a dataset into GenomeSpace, provided that he/she has a registered account with GenomeSpace. Once logged in, the user will be able to utilize the data analysis tools available through GenomeSpace, including GenePattern, Galaxy (13) and Cytoscape (14), to perform data analysis.

For HTS data, the Bioconductor package ArrayExpressHTS (15) and the R-workbench (http://www.ebi.ac.uk/Tools/rcloud/) are used to generate binary alignment (BAM) format files (16). BAM files contain sequence alignment data and can be displayed using the Ensembl genome browser (17), through a direct link from ArrayExpress. So far approximately 1200 BAM files are available for 125 RNA-seq experiments, for 14 different species, with over half of these data studying human and a quarter mouse. The BAM file generation has been done for experiments for which: (i) the sample–data relationship information is available and contains details such as the library strategy and the experiment type (i.e. RNA-seq); (ii) the raw sequence reads (FASTQ files) are deposited in ENA and a valid link to the ENA entry is present; and (iii) the annotation for the reference genome is available in Ensembl.

In addition, 3000 datasets from ArrayExpress have been analysed and the results of this analysis are presented through the Gene Expression Atlas (9), a separate EBI database, which helps users to (i) find out whether the expression of a gene (or a group of genes with a common gene attribute, e.g. GO term) change(s) across all the experiments or (ii) discover which genes are differentially expressed in a particular biological condition of interest.

CONTINUOUS USER INTERFACE IMPROVEMENTS

The ArrayExpress user interface has been continuously improved since the repository was established in 2003 (18). Recent additions include the sample–data relationship viewer (Figure 1), which provides an overview of all samples used in an experiment and their characteristics, the experimental variables (factors) investigated and the data files associated with each sample.

Figure 1.

Sample–data relationship viewer for Experiment E-MTAB-513. This view provides information on sample characteristics and experimental variables that are fundamental to understand the results obtained in the experiment. Generally, each row corresponds to a sample. Columns include sample characteristics and their relationship to the resulting data files, providing a quick view over the structure of the experiment and the biological questions that the authors addressed. The last column provides links to raw sequence data files available in ENA, and BAM files that can be visualized in the Ensembl genome browser.

Other improvements include (i) improved array designs browsing and querying for; (ii) specific features for HTS data display; (iii) better organization of the species drop-down filter, and (iv) improved performance for retrieving and visualizing large experiments.

The ArrayExpress user documentation has recently been updated and several online courses, covering how to search, interpret and submit data to ArrayExpress, can be found on the EBI e-Learning portal, Train online (http://www.ebi.ac.uk/training/online/).

FUTURE DEVELOPMENTS

We are currently developing a new submission tool, optimized for supporting HTS data submissions; this new tool is based on the community developed annotation tool Annotare (19) and will be released in 2013.

Like all other major EBI data resources, ArrayExpress is working toward deeper integration in the overall EBI infrastructure, in particular with the BioSample Database (20), the Gene Expression Atlas and the sequence databases ENA, EGA and Ensembl. We will continue this integration effort to ensure that our users can obtain a systems level view of the data stored at EBI by easily navigating through our resources.

FUNDING

ArrayExpress and related activities are supported by member states of the European Molecular Biology Laboratory; European Commission: ENGAGE [201413], EurocanPlatform [260791], GEUVADIS [261123], SLING [226073], SYBARIS [242220], and Gen2Phen [200754]; US National Institutes of Health (the National Human Genome Research Institute, National Institute of Biomedical Imaging and Bioengineering and the National Cancer Institute) [P41 HG003619]; National Center for Biomedical Ontology, one of the National Centers for Biomedical Computing supported by the National Human Genome Research Institute, the National Heart, Lung, and Blood Institute, and National Institutes of Health Common Fund [U54-HG004028]; National Science Foundation Award Number [1127112]. Funding for open access charge: EMBL Members states.

Conflict of interest statement. None declared.

REFERENCES

et al.

ArrayExpress update—an archive of microarray and high-throughput sequencing-based functional genomics experiments

Nucleic Acids Res.

2011

, vol.

(pg.

D1002

D1004

)

et al.

Minimum information about a microarray experiment (MIAME)—toward standards for microarray data

Nat. Genetics

2001

, vol.

(pg.

365

371

)

et al.

NCBI GEO: archive for functional genomics data sets–10 years on

Nucleic Acids Res.

2011

, vol.

Suppl. 1

(pg.

D1005

D1010

)

et al.

A simple spreadsheet-based, MIAME-supportive format for microarray data: MAGE-TAB

BMC Bioinformatics

2006

, vol.

pg.

489

et al.

Bioconductor: open software development for computational biology and bioinformatics

Genome Biol.

2004

, vol.

pg.

R80

MAGETabulator, a suite of tools to support the microarray data format MAGE-TAB

Bioinformatics

2009

, vol.

(pg.

279

280

)

The DNA Data Bank of Japan launches a new resource, the DDBJ omics archive of functional genomics experiments

Nucleic Acids Res.

2012

, vol.

(pg.

D38

D42

)

Modeling sample variables with an experimental factor ontology

Bioinformatics

2010

, vol.

(pg.

1112

1118

)

et al.

Gene expression atlas update—a value-added database of microarray and sequencing-based functional genomics experiments

Nucleic Acids Res.

2012

, vol.

(pg.

D1077

D1081

)

et al.

Petabyte-scale innovations at the European nucleotide archive

Nucleic Acids Res.

2009

, vol.

(pg.

D19

D25

)

GenePattern 2.0

Nat. Genet.

2006

, vol.

(pg.

500

501

)

Importing ArrayExpress datasets into R/Bioconductor

Bioinformatics

2009

, vol.

(pg.

2092

2094

)

Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences

Genome Biol.

2010

, vol.

pg.

R86

Cytoscape: a software environment for integrated models of biomolecular interaction networks

Genome Res.

2003

, vol.

(pg.

2498

2504

)

A pipeline for RNA-seq data processing and quality assessment

Bioinformatics

2011

, vol.

(pg.

867

869

)

The sequence alignment/map format and SAMtools

Bioinformatics

2009

, vol.

(pg.

2078

2079

)

et al.

Ensembl 2012

Nucleic Acids Res.

2012

, vol.

(pg.

D84

D90

)

et al.

ArrayExpress—a public repository for microarray gene expression data at the EBI

Nucleic Acids Res.

2003

, vol.

(pg.

)

et al.

Annotare—a tool for annotating high-throughput biomedical investigations and resulting data

Bioinformatics

2010

, vol.

(pg.

2470

2471

)

The BioSample Database (BioSD) at the European Bioinformatics Institute

Nucleic Acids Res.

2012

, vol.

(pg.

D64

D70

)

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by-nc/3.0/), which permits non-commercial reuse, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com.

I agree to the terms and conditions. You must accept the terms and conditions.

Submit a comment

Name

Affiliations

Comment title

Comment

You have entered an invalid code

Thank you for submitting a comment on this article. Your comment will be reviewed and published at the journal's discretion. Please check for further notifications by email.

Citations

Views

Altmetric

Metrics

Total Views 3,315

2,629 Pageviews

686 PDF Downloads

Since 12/1/2016

Month:	Total Views:
December 2016	7
January 2017	2
February 2017	13
March 2017	15
April 2017	13
May 2017	14
June 2017	8
July 2017	8
August 2017	10
September 2017	11
October 2017	11
November 2017	13
December 2017	43
January 2018	33
February 2018	32
March 2018	43
April 2018	56
May 2018	25
June 2018	33
July 2018	27
August 2018	21
September 2018	27
October 2018	41
November 2018	29
December 2018	22
January 2019	19
February 2019	45
March 2019	56
April 2019	62
May 2019	44
June 2019	31
July 2019	27
August 2019	30
September 2019	24
October 2019	35
November 2019	31
December 2019	24
January 2020	32
February 2020	16
March 2020	37
April 2020	21
May 2020	15
June 2020	21
July 2020	27
August 2020	44
September 2020	50
October 2020	38
November 2020	40
December 2020	35
January 2021	37
February 2021	47
March 2021	65
April 2021	51
May 2021	52
June 2021	36
July 2021	27
August 2021	35
September 2021	40
October 2021	26
November 2021	50
December 2021	36
January 2022	40
February 2022	37
March 2022	45
April 2022	48
May 2022	39
June 2022	35
July 2022	43
August 2022	33
September 2022	50
October 2022	55
November 2022	41
December 2022	46
January 2023	37
February 2023	48
March 2023	49
April 2023	62
May 2023	40
June 2023	44
July 2023	30
August 2023	46
September 2023	29
October 2023	37
November 2023	36
December 2023	43
January 2024	47
February 2024	51
March 2024	72
April 2024	47
May 2024	48
June 2024	38
July 2024	28
August 2024	28
September 2024	37
October 2024	43

Citations

284 Web of Science

ArrayExpress update—trends in database growth and links to data analysis tools (original) (raw)

Cite

Abstract

INTRODUCTION

DATA GROWTH TO A MILLION ASSAYS

LINKS TO DATA ANALYSIS TOOLS

CONTINUOUS USER INTERFACE IMPROVEMENTS

FUTURE DEVELOPMENTS

FUNDING

REFERENCES

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Cited

ArrayExpress update—trends in database growth and links to data analysis tools (original) (raw)

Cite

Abstract

INTRODUCTION

DATA GROWTH TO A MILLION ASSAYS

LINKS TO DATA ANALYSIS TOOLS

CONTINUOUS USER INTERFACE IMPROVEMENTS

FUTURE DEVELOPMENTS

FUNDING

REFERENCES

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited