Accessing curated gene expression data with gemma.R (original) (raw)

1 About Gemma
2 Package cheat sheet
3 Installation instructions
- 3.1 Bioconductor
4 Searching for datasets of interest in Gemma
5 Downloading expression data
6 Platform Annotations
7 Differential expression analyses
8 Larger queries
9 Output options
10 Session info

library(gemma.R)
library(data.table)
library(dplyr)
library(ggplot2)
library(ggrepel)
library(SummarizedExperiment)
library(pheatmap)
library(viridis)
library(listviewer)

About Gemma

Gemma is a web site, database and a set of tools for the meta-analysis, re-use and sharing of genomics data, currently primarily targeted at the analysis of gene expression profiles. Gemma contains data from thousands of public studies, referencing thousands of published papers. Every dataset in Gemma has passed a rigorous curation process that re-annotates the expression platform at the sequence level, which allows for more consistent cross-platform comparisons and meta-analyses.

For detailed information on the curation process, read thispage or the latestpublication.

Package cheat sheet

Installation instructions

Bioconductor

You can install gemma.R throughBioconductor with the following code:

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("gemma.R")

Searching for datasets of interest in Gemma

Using the get_datasets function, datasets fitting various criteria can be accessed.

# accessing all mouse and human datasets
get_datasets(taxa = c('mouse','human')) %>% 
    select(experiment.shortName, experiment.name, 
           experiment.description,taxon.name) %>%
    head %>% gemma_kable

# accessing human datasets with the word "bipolar"
get_datasets(query = 'bipolar',taxa = 'human') %>% 
    select(experiment.shortName, experiment.name, 
           experiment.description,taxon.name) %>%
    head %>% gemma_kable

# access human datasets that were annotated with the ontology term for the
# bipolar disorder
# use search_annotations function to search for available annotation terms
get_datasets(taxa ='human', 
             uris = 'http://purl.obolibrary.org/obo/MONDO_0004985') %>%
    select(experiment.shortName, experiment.name, 
           experiment.description,taxon.name) %>%
    head %>% gemma_kable

get_dataset function also includes a filter parameter that allows filtering for datasets with specific properties in a more structured manner. A list of the available properties can be accessed using filter_properties

filter_properties()$dataset %>% head %>% gemma_kable()

These properties can be used together to fine tune your results

# access human datasets that has bipolar disorder as an experimental factor
get_datasets(taxa = 'human',
             filter = "experimentalDesign.experimentalFactors.factorValues.characteristics.valueUri = http://purl.obolibrary.org/obo/MONDO_0004985")  %>%
    select(experiment.shortName, experiment.name, 
           experiment.description,taxon.name) %>%
    head %>% gemma_kable

# all datasets with more than 4 samples annotated for any disease
get_datasets(filter = 'bioAssayCount > 4 and allCharacteristics.category = disease') %>%
    select(experiment.shortName, experiment.name, 
           experiment.description,taxon.name) %>%
    head %>% gemma_kable

# all datasets with ontology terms for Alzheimer's disease and Parkinson's disease
# this is equivalent to using the uris parameter
get_datasets(filter = 'allCharacteristics.valueUri in (http://purl.obolibrary.org/obo/MONDO_0004975,http://purl.obolibrary.org/obo/MONDO_0005180
)')  %>%
    select(experiment.shortName, experiment.name, 
           experiment.description,taxon.name) %>%
    head %>% gemma_kable

Note that a single call of these functions will only return 20 results by default and a 100 results maximum, controlled by the limit argument. In order to get all available results, use get_all_pages function on the output of the function

get_datasets(taxa = 'human') %>% 
    get_all_pages() %>% 
    select(experiment.shortName, experiment.name, 
           experiment.description,taxon.name) %>%
    head %>% gemma_kable

See larger queries section for more details. To keep this vignette simpler we will keep using the first 20 results returned by default in examples below.

Dataset information provided by get_datasets also includes some quality information that can be used to determine the suitability of any given experiment. For instance experiment.batchEffect column will be set to -1 if Gemma’s preprocessing has detected batch effects that were unable to be resolved by batch correction. More information about these and other fields can be found at the function documentation.

get_datasets(taxa = 'human', filter = 'bioAssayCount > 4') %>% 
     filter(experiment.batchEffect !=-1) %>% 
    select(experiment.shortName, experiment.name, 
           experiment.description,taxon.name) %>%
    head %>% gemma_kable

Gemma uses multiple ontologies when annotating datasets and using the term URIs instead of free text to search can lead to more specific results. search_annotations function allows searching for annotation terms that might be relevant to your query.

search_annotations('bipolar') %>% 
    head %>% gemma_kable()

Downloading expression data

Upon identifying datasets of interest, more information about specific ones can be requested. In this example we will be using GSE46416 which includes samples taken from healthy donors along with manic/euthymic phase bipolar disorder patients.

The data associated with specific experiments can be accessed by using get_datasets_by_ids

get_datasets_by_ids("GSE46416") %>%
    select(experiment.shortName, experiment.name, 
           experiment.description,taxon.name) %>%
    head %>% gemma_kable

To access the expression data in a convenient form, you can useget_dataset_object. It is a high-level wrapper that combines various endpoint calls to return lists of annotatedSummarizedExperimentorExpressionSetobjects that are compatible with other Bioconductor packages or atidyverse-friendlylong form tibble for downstream analyses. These include the expression matrix along with the experimental design, and ensure the sample names match between both when transforming/subsetting data.

dat <- get_dataset_object("GSE46416",
                          type = 'se') # SummarizedExperiment is the default output type

Note that the tidy format is less memory efficient but allows easy visualization and exploration withggplot2 and the rest of thetidyverse.

To show how subsetting works, we’ll keep the “manic phase” data and thereference_subject_roles, which refers to the control samples in Gemma datasets.

# Check the levels of the disease factor
dat[[1]]$disease %>% unique()

[1] "bipolar disorder with euthymic phase"
[2] "reference subject role"              
[3] "bipolar disorder with manic phase"

# Subset patients during manic phase and controls
manic <- dat[[1]][, dat[[1]]$disease == "bipolar disorder with manic phase" | 
        dat[[1]]$disease == "reference subject role"]
manic

class: SummarizedExperiment 
dim: 18758 21 
metadata(8): title abstract ... gemmaSuitabilityScore taxon
assays(1): counts
rownames(18758): 2315430 2315554 ... 7385683 7385696
rowData names(4): Probe GeneSymbol GeneName NCBIid
colnames(21): Control, 12 Control, 1_DE50 ... Bipolar disorder patient
  manic phase, 21 Bipolar disorder patient manic phase, 16
colData names(3): factorValues block disease

Let’s take a look at sample to sample correlation in our subset.

# Get Expression matrix
manicExpr <- assay(manic, "counts")


manicExpr %>% 
    cor %>% 
    pheatmap(col =viridis(10),border_color = NA,angle_col = 45,fontsize = 7)

Figure 1: Sample to sample correlations of bipolar patients during manic phase and controls

You can also useget_dataset_processed_expressionto only get the expression matrix, get_dataset_samples to get the metadata information. The output of this function includes some additional details about a sample such as the original accession ID or whether or not it was determined to be an outlier but it can be simplified to match the design table included in the output of get_dataset_object by using make_design on the output.

get_dataset_samples('GSE46416') %>% make_design('text') %>% select(-factorValues) %>%  head %>%
    gemma_kable()

Platform Annotations

Expression data in Gemma comes with annotations for the gene each expression profile corresponds to. Using theget_platform_annotationsfunction, these annotations can be retrieved independently of the expression data, along with additional annotations such as Gene Ontology terms.

Examples:

head(get_platform_annotations('GPL96') %>% select(-GOTerms))

   ElementName GeneSymbols
        <char>      <char>
1: 202619_s_at       PLOD2
2: 219952_s_at      MCOLN1
3:   205192_at     MAP3K14
4:   212724_at        RND3
5:   207373_at      HOXD10
6: 204991_s_at         NF2
                                                  GeneNames GemmaIDs NCBIids
                                                     <char>   <char>  <char>
1:        procollagen-lysine,2-oxoglutarate 5-dioxygenase 2   127276    5352
2:                           mucolipin TRP cation channel 1   313978   57192
3:        mitogen-activated protein kinase kinase kinase 14   188876    9020
4:                                      Rho family GTPase 3     9353     390
5:                                             homeobox D10    72301    3236
6: NF2, moesin-ezrin-radixin like (MERLIN) tumor suppressor   116071    4771
        EnsemblIds
            <char>
1: ENSG00000152952
2: ENSG00000090674
3: ENSG00000006062
4: ENSG00000115963
5: ENSG00000128710
6: ENSG00000186575

head(get_platform_annotations('Generic_human_ncbiIds') %>% select(-GOTerms))

   ElementName  GeneSymbols
         <int>       <char>
1:       55236         UBA6
2:       79664         ICE2
3:   100126270     FMR1-AS1
4:   105373684    LINC01818
5:   124900245 LOC124900245
6:   102725051 LOC102725051
                                               GeneNames GemmaIDs   NCBIids
                                                  <char>    <int>     <int>
1:           ubiquitin like modifier activating enzyme 6   295849     55236
2: interactor of little elongation complex ELL subunit 2   336840     79664
3:                                  FMR1 antisense RNA 1  3157248 100126270
4:           long intergenic non-protein coding RNA 1818  9235895 105373684
5:                           small nucleolar RNA SNORA40 10578422 124900245
6:                          uncharacterized LOC102725051  9008981 102725051

If you are interested in a particular gene, you can see which platforms include it usingget_gene_probes. Note that functions to search gene work best with unambigious identifiers rather than symbols.

# lists genes in gemma matching the symbol or identifier
get_genes('Eno2') %>% gemma_kable()

# ncbi id for human ENO2
probes <- get_gene_probes(2026)

# remove the description for brevity of output
head(probes[,.SD, .SDcols = !colnames(probes) %in% c('mapping.Description','platform.Description')]) %>%
    gemma_kable()

Differential expression analyses

Gemma contains precomputed differential expression analyses for most of its datasets. Analyses can involve more than one factor, such as “sex” as well as “disease”. Some datasets contain more than one analysis to account for different factors and their interactions. The results are stored as resultSets, each corresponding to one factor (or their interaction). You can access them usingget_differential_expression_values. From here on, we can explore and visualize the data to find the most differentially-expressed genes

Note that get_differential_expression_values can return multiple differentials per study if a study has multiple factors to contrast. Since GSE46416 only has one extracting the first element of the returned list is all we need.

dif_exp <- get_differential_expression_values('GSE46416')
dif_exp[[1]] %>% head %>% gemma_kable()

By default the columns names of the output correspond to contrast IDs. To see what conditions these IDs correspond to we can either use get_dataset_differential_expression_analysesto get the metadata about differentials of a given dataset, or set readableContrasts argument of get_differential_expression_values to TRUE. The former approach is usually better for a large scale systematic analysis while the latter is easier to read in an interactive session.

get_dataset_differential_expression_analyses function returns structured metadata about the differentials.

contrasts <- get_dataset_differential_expression_analyses('GSE46416')
contrasts %>% gemma_kable()

contrast.ID column corresponds to the column names in the output of get_differential_expression_values while result.ID corresponds to the name of the differential in the output object. Using them together will let one to access differentially expressed gene counts for each condition contrast

# using result.ID and contrast.ID of the output above, we can access specific
# results. Note that one study may have multiple contrast objects
seq_len(nrow(contrasts)) %>% sapply(function(i){
    result_set = dif_exp[[as.character(contrasts[i,]$result.ID)]]
    p_values = result_set[[glue::glue("contrast_{contrasts[i,]$contrast.ID}_pvalue")]]
    
    # multiple testing correction
    sum(p.adjust(p_values,method = 'BH') < 0.05)
}) -> dif_exp_genes

contrasts <- data.table(result.ID = contrasts$result.ID,
                        contrast.id = contrasts$contrast.ID,
                        baseline.factorValue = contrasts$baseline.factors,
                        experimental.factorValue = contrasts$experimental.factors,
                        n_diff = dif_exp_genes)

contrasts %>% gemma_kable()

contrasts$baseline.factors

NULL

contrasts$experimental.factors

NULL

Alternatively we, since we are only looking at one dataset and one contrast manually, we can simply use readableContrasts.

de <- get_differential_expression_values("GSE46416",readableContrasts = TRUE)[[1]]
de %>% head %>% gemma_kable

# Classify probes for plotting
de$diffexpr <- "No"
de$diffexpr[de$`contrast_bipolar disorder with manic phase_logFoldChange` > 1.0 & 
        de$`contrast_bipolar disorder with manic phase_pvalue` < 0.05] <- "Up"
de$diffexpr[de$`contrast_bipolar disorder with manic phase_logFoldChange` < -1.0 & 
        de$`contrast_bipolar disorder with manic phase_pvalue` < 0.05] <- "Down"

# Upregulated probes
filter(de, diffexpr == "Up") %>%
    arrange(`contrast_bipolar disorder with manic phase_pvalue`) %>%
    select(Probe, GeneSymbol, `contrast_bipolar disorder with manic phase_pvalue`, 
        `contrast_bipolar disorder with manic phase_logFoldChange`) %>%
    head(10) %>% gemma_kable()

# Downregulated probes
filter(de, diffexpr == "Down") %>%
    arrange(`contrast_bipolar disorder with manic phase_pvalue`) %>%
    select(Probe, GeneSymbol, `contrast_bipolar disorder with manic phase_pvalue`, 
        `contrast_bipolar disorder with manic phase_logFoldChange`) %>%
    head(10) %>% gemma_kable()

# Add gene symbols as labels to DE genes
de$delabel <- ""
de$delabel[de$diffexpr != "No"] <- de$GeneSymbol[de$diffexpr != "No"]

# Volcano plot for bipolar patients vs controls
ggplot(
    data = de,
    aes(
        x = `contrast_bipolar disorder with manic phase_logFoldChange`,
        y = -log10(`contrast_bipolar disorder with manic phase_pvalue`),
        color = diffexpr,
        label = delabel
    )
) +
    geom_point() +
    geom_hline(yintercept = -log10(0.05), col = "gray45", linetype = "dashed") +
    geom_vline(xintercept = c(-1.0, 1.0), col = "gray45", linetype = "dashed") +
    labs(x = "log2(FoldChange)", y = "-log10(p-value)") +
    scale_color_manual(values = c("blue", "black", "red")) +
    geom_text_repel(show.legend = FALSE) +
    theme_minimal()

Figure 2: Differentially-expressed genes in bipolar patients during manic phase versus controls

Larger queries

To query large amounts of data, the API has a pagination system which uses the limit and offset parameters. To avoid overloading the server, calls are limited to a maximum of 100 entries, so the offset allows you to get the next batch of entries in the next call(s).

To simplify the process of accessing all available data, gemma.R includes theget_all_pages function which can use the output from one page to make all the follow up requests

get_platforms_by_ids() %>% 
    get_all_pages() %>% head %>% gemma_kable()

Alternative way to access all pages is to do so manually. To see how many available results are there, you can look at the attributes of the output objects where additional information from the API response is appended.

platform_count = attributes(get_platforms_by_ids(limit = 1))$totalElements
print(platform_count)

[1] 649

After which you can use offset to access all available platforms.

lapply(seq(0,platform_count,100), function(offset){
    get_platforms_by_ids(limit = 100, offset = offset) %>%
        select(platform.ID, platform.shortName, taxon.name)
}) %>% do.call(rbind,.) %>% 
    head %>% gemma_kable()

Many endpoints only support a single identifier:

get_dataset_annotations(c("GSE35974", "GSE46416"))

Error: Please specify one valid identifier for dataset.

In these cases, you will have to loop over all the identifiers you wish to query and send separate requests.

lapply(c("GSE35974", "GSE12649"), function(dataset) {
    get_dataset_annotations(dataset) %>% 
        mutate(experiment.shortName = dataset) %>%
        select(experiment.shortName, class.name, term.name)
}) %>% do.call(rbind,.) %>% gemma_kable()

Output options

Raw data

By default, Gemma API does some parsing on the raw API results to make it easier to work with inside of R. In the process, it drops some typically unused values. If you wish to fetch everything, useraw = TRUE. Instead of a data table, you’ll usually be served a list that represents the underlying JSON response.

get_gene_locations("DYRK1A") %>% gemma_kable()

get_gene_locations("DYRK1A", raw = TRUE) %>% jsonedit()

File outputs

Sometimes, you may wish to save results to a file for future inspection. You can do this simply by providing a filename to the file parameter. The extension for this file will be one of three options:

.json, if you requested results with raw=TRUE
.csv if the results have no nested data tables
.rds otherwise

You can also specify whether or not the new fetched results are allowed to overwrite an existing file by specifying the overwrite = TRUEparameter.

Memoise data

To speed up results, you can remember past results so future queries can proceed virtually instantly. This is enabled through thememoise package. To enable memoisation, simply set memoised = TRUE in the function call whenever you want to refer to the cache, both to save data for future calls and use the saved data for repeated calls. By default this will create a cache in your local filesystem.

If you wish to change where the cache is stored or change the default behaviour to make sure you always use the cache without relying on the memoised argument, use gemma_memoised.

# use memoisation by default using the default cache
gemma_memoised(TRUE)

# set an altnernate cache path
gemma_memoised(TRUE,"path/to/cache_directory")

# cache in memory of the R session
# this cache will not be preserved between sessions
gemma_memoised(TRUE,"cache_in_memory")

If you’re done with your fetching and want to ensure no space is being used for cached results, or if you just want to ensure you’re getting up-to-date data from Gemma, you can clear the cache usingforget_gemma_memoised.

Changing defaults

We’ve seen how to change raw = TRUE, overwrite = TRUE andmemoised = TRUE in individual function calls. It’s possible that you want to always use the functions these ways without specifying the option every time. You can do this by simply changing the default, which is visible in the function definition. See below for examples.

options(gemma.memoised = TRUE) # always refer to cache. this is redundant with gemma_memoised function
options(gemma.overwrite = TRUE) # always overwrite when saving files
options(gemma.raw = TRUE) # always receive results as-is from Gemma

Session info

sessionInfo()

R version 4.5.1 (2025-06-13)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.2 LTS

Matrix products: default
BLAS:   /home/biocbuild/bbs-3.22-bioc/R/lib/libRblas.so 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB              LC_COLLATE=C              
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/New_York
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] listviewer_4.0.0            viridis_0.6.5              
 [3] viridisLite_0.4.2           pheatmap_1.0.13            
 [5] SummarizedExperiment_1.39.1 Biobase_2.69.0             
 [7] GenomicRanges_1.61.1        Seqinfo_0.99.1             
 [9] IRanges_2.43.0              S4Vectors_0.47.0           
[11] BiocGenerics_0.55.0         generics_0.1.4             
[13] MatrixGenerics_1.21.0       matrixStats_1.5.0          
[15] ggrepel_0.9.6               ggplot2_3.5.2              
[17] dplyr_1.1.4                 data.table_1.17.6          
[19] gemma.R_3.5.4               BiocStyle_2.37.0           

loaded via a namespace (and not attached):
 [1] tidyselect_1.2.1    farver_2.1.2        fastmap_1.2.0      
 [4] digest_0.6.37       timechange_0.3.0    lifecycle_1.0.4    
 [7] magrittr_2.0.3      compiler_4.5.1      rlang_1.1.6        
[10] sass_0.4.10         tools_4.5.1         yaml_2.3.10        
[13] knitr_1.50          labeling_0.4.3      S4Arrays_1.9.1     
[16] htmlwidgets_1.6.4   bit_4.6.0           curl_6.4.0         
[19] DelayedArray_0.35.2 xml2_1.3.8          RColorBrewer_1.1-3 
[22] abind_1.4-8         withr_3.0.2         purrr_1.0.4        
[25] grid_4.5.1          scales_1.4.0        tinytex_0.57       
[28] dichromat_2.0-0.1   cli_3.6.5           rmarkdown_2.29     
[31] crayon_1.5.3        rstudioapi_0.17.1   httr_1.4.7         
[34] cachem_1.1.0        stringr_1.5.1       assertthat_0.2.1   
[37] BiocManager_1.30.26 XVector_0.49.0      vctrs_0.6.5        
[40] Matrix_1.7-3        jsonlite_2.0.0      bookdown_0.43      
[43] bit64_4.6.0-1       magick_2.8.7        systemfonts_1.2.3  
[46] jquerylib_0.1.4     glue_1.8.0          lubridate_1.9.4    
[49] stringi_1.8.7       gtable_0.3.6        tibble_3.3.0       
[52] pillar_1.10.2       rappdirs_0.3.3      htmltools_0.5.8.1  
[55] R6_2.6.1            textshaping_1.0.1   evaluate_1.0.4     
[58] kableExtra_1.4.0    lattice_0.22-7      memoise_2.0.1      
[61] bslib_0.9.0         Rcpp_1.0.14         svglite_2.2.1      
[64] gridExtra_2.3       SparseArray_1.9.0   xfun_0.52          
[67] pkgconfig_2.0.3