cBioPortalData: User Guide (original) (raw)

Contents

Installation

library(cBioPortalData)
library(AnVIL)

Introduction

The cBioPortal for Cancer Genomics website is a great resource for interactive exploration of study datasets. However, it does not easily allow the analyst to obtain and further analyze the data.

We’ve developed the cBioPortalData package to fill this need to programmatically access the data resources available on the cBioPortal.

The cBioPortalData package provides an R interface for accessing the cBioPortal study data within the Bioconductor ecosystem.

It downloads study data from the cBioPortal API (the full API specification can be found here https://cbioportal.org/api) and uses Bioconductor infrastructure to cache and represent the data.

We demonstrate common use cases of cBioPortalData and curatedTCGADataduring Bioconductor conferenceworkshops.

We use the MultiAssayExperiment (Ramos et al. (2017)) package to integrate, represent, and coordinate multiple experiments for the studies available in the cBioPortal. This package in conjunction with curatedTCGAData give access to a large trove of publicly available bioinformatic data. Please see our JCO Clinical Cancer Informatics publication here (Ramos et al. (2020)).

Citations

Our free and open source project depends on citations for funding. When usingcBioPortalData, please cite the following publications:

citation("MultiAssayExperiment")
citation("cBioPortalData")

Overview

Data Structures

Data are provided as a single MultiAssayExperiment per study. TheMultiAssayExperiment representation usually contains SummarizedExperimentobjects for expression data and RaggedExperiment objects for mutation and CNV-type data. RaggedExperiment is a data class for representing ‘ragged’ genomic location data, meaning that the measurements per sample vary.

For more information, please see the RaggedExperiment andSummarizedExperiment vignettes.

Identifying available studies

As we work through the data, there are some datasest that cannot be represented as MultiAssayExperiment objects. This can be due to a number of reasons such as the way the data is handled, presence of mis-matched identifiers, invalid data types, etc. To see what datasets are currently not building, we can look refer to getStudies() with the buildReport = TRUE argument.

cbio <- cBioPortal()
studies <- getStudies(cbio, buildReport = TRUE)
head(studies)
## # A tibble: 6 × 15
##   name           description publicStudy pmid  citation groups status importDate
##   <chr>          <chr>       <lgl>       <chr> <chr>    <chr>   <int> <chr>     
## 1 Acute Lymphob… Comprehens… TRUE        2573… Anderss… "PUBL…      0 2024-12-0…
## 2 Hypodiploid A… Whole geno… TRUE        2333… Holmfel… ""          0 2024-12-0…
## 3 Adenoid Cysti… Targeted S… TRUE        2441… Ross et… "ACYC…      0 2024-12-0…
## 4 Adenoid Cysti… Whole-geno… TRUE        2686… Rettig … "ACYC…      0 2024-12-0…
## 5 Adenoid Cysti… WGS of 21 … TRUE        2663… Mitani … "ACYC…      0 2024-12-0…
## 6 Adenoid Cysti… Whole-geno… TRUE        2682… Drier e… "ACYC"      0 2024-12-0…
## # ℹ 7 more variables: allSampleCount <int>, readPermission <lgl>,
## #   studyId <chr>, cancerTypeId <chr>, referenceGenome <chr>, api_build <lgl>,
## #   pack_build <lgl>

The last two columns will show the availability of each studyId for either download method (pack_build for cBioDataPack and api_build forcBioPortalData).

Choosing download method

There are two main user-facing functions for downloading data from the cBioPortal API.

Two main functions

cBioDataPack: Obtain Study Data as Zipped Tarballs

This function will access the packaged data from and return an integrative MultiAssayExperiment representation.

## Use ask=FALSE for non-interactive use
laml <- cBioDataPack("laml_tcga", ask = FALSE)
laml
## A MultiAssayExperiment object of 12 listed
##  experiments with user-defined names and respective classes.
##  Containing an ExperimentList class object of length 12:
##  [1] cna: SummarizedExperiment with 24776 rows and 191 columns
##  [2] cna_hg19.seg: RaggedExperiment with 13571 rows and 191 columns
##  [3] linear_cna: SummarizedExperiment with 24776 rows and 191 columns
##  [4] methylation_hm27: SummarizedExperiment with 10968 rows and 194 columns
##  [5] methylation_hm450: SummarizedExperiment with 10968 rows and 194 columns
##  [6] mrna_seq_rpkm: SummarizedExperiment with 19720 rows and 179 columns
##  [7] mrna_seq_rpkm_zscores_ref_all_samples: SummarizedExperiment with 19720 rows and 179 columns
##  [8] mrna_seq_rpkm_zscores_ref_diploid_samples: SummarizedExperiment with 19719 rows and 179 columns
##  [9] mrna_seq_v2_rsem: SummarizedExperiment with 20531 rows and 173 columns
##  [10] mrna_seq_v2_rsem_zscores_ref_all_samples: SummarizedExperiment with 20531 rows and 173 columns
##  [11] mrna_seq_v2_rsem_zscores_ref_diploid_samples: SummarizedExperiment with 20440 rows and 173 columns
##  [12] mutations: RaggedExperiment with 2584 rows and 197 columns
## Functionality:
##  experiments() - obtain the ExperimentList instance
##  colData() - the primary/phenotype DataFrame
##  sampleMap() - the sample coordination DataFrame
##  `$`, `[`, `[[` - extract colData columns, subset, or experiment
##  *Format() - convert into a long or wide DataFrame
##  assays() - convert ExperimentList to a SimpleList of matrices
##  exportClass() - save data to flat files

cBioPortalData: Obtain data from the cBioPortal API

This function provides a more flexible and granular way to request aMultiAssayExperiment object from a study ID, molecular profile, gene panel, sample list.

acc <- cBioPortalData(api = cbio, by = "hugoGeneSymbol", studyId = "acc_tcga",
    genePanelId = "IMPACT341",
    molecularProfileIds = c("acc_tcga_rppa", "acc_tcga_linear_CNA")
)
## harmonizing input:
##   removing 1 colData rownames not in sampleMap 'primary'
acc
## A MultiAssayExperiment object of 2 listed
##  experiments with user-defined names and respective classes.
##  Containing an ExperimentList class object of length 2:
##  [1] acc_tcga_linear_CNA: SummarizedExperiment with 339 rows and 90 columns
##  [2] acc_tcga_rppa: SummarizedExperiment with 57 rows and 46 columns
## Functionality:
##  experiments() - obtain the ExperimentList instance
##  colData() - the primary/phenotype DataFrame
##  sampleMap() - the sample coordination DataFrame
##  `$`, `[`, `[[` - extract colData columns, subset, or experiment
##  *Format() - convert into a long or wide DataFrame
##  assays() - convert ExperimentList to a SimpleList of matrices
##  exportClass() - save data to flat files

Note. To avoid overloading the API service, the API was designed to only query a part of the study data. Therefore, the user is required to enter either a set of genes of interest or a gene panel identifier.

Considerations

Note that cBioPortalData and cBioDataPack obtain data diligently curated by the cBio Portal data team. The original data and curation lies in thehttps://github.com/cBioPortal/cBioPortal GitHub repository. However, despite the curation efforts there may be some inconsistencies in identifiers in the data. This causes our software to not work as intended though we have made efforts to represent all the data from both API and tarball formats.

Build prompts

You will also get a message for studyIds whose data has not been fully integrated into a MultiAssayExperiment.

## Our testing shows that '%s' is not currently building.
##    Use 'downloadStudy()' to manually obtain the data.
##    Proceed anyway? [y/n]: y

Manual downloads

For this reason, we have also provided the downloadStudy, untarStudy, andloadStudy functions to allow researchers to simply download the data and potentially, manually curate it. Generally, we advise researchers to report inconsistencies in the data in the cBioPortal data repository.

Clearing the cache

cBioDataPack

In cases where a download is interrupted, the user may experience a corrupt cache. The user can clear the cache for a particular study by using theremoveCache function. Note that this function only works for data downloaded through the cBioDataPack function.

removeCache("laml_tcga")

cBioPortalData

For users who wish to clear the entire cBioPortalData cache, it is recommended that they use:

unlink("~/.cache/cBioPortalData/")

Example Analysis: Kaplan-Meier Plot

We can use information in the colData to draw a K-M plot with a few variables from the colData slot of the MultiAssayExperiment. First, we load the necessary packages:

library(survival)
library(survminer)

We can check the data to lookout for any issues.

table(colData(laml)$OS_STATUS)
## 
##   0:LIVING 1:DECEASED 
##         67        133
class(colData(laml)$OS_MONTHS)
## [1] "character"

Now, we clean the data a bit to ensure that our variables are of the right type for the subsequent survival model fit.

collaml <- colData(laml)
collaml[collaml$OS_MONTHS == "[Not Available]", "OS_MONTHS"] <- NA
collaml$OS_MONTHS <- as.numeric(collaml$OS_MONTHS)
colData(laml) <- collaml

We specify a simple survival model using SEX as a covariate and we draw the K-M plot.

fit <- survfit(
    Surv(OS_MONTHS, as.numeric(substr(OS_STATUS, 1, 1))) ~ SEX,
    data = colData(laml)
)
ggsurvplot(fit, data = colData(laml), risk.table = TRUE)

Data update requests

If you are interested in a particular study dataset that is not currently building, please open an issue at our GitHub repository and we will do our best to resolve the issues with the code base. Data issues can be opened at the cBioPortal data repository.

We appreciate your feedback!

sessionInfo

Click to see session info

sessionInfo()
## R version 4.5.0 RC (2025-04-04 r88126)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.2 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] survminer_0.5.0             ggpubr_0.6.0               
##  [3] ggplot2_3.5.2               survival_3.8-3             
##  [5] cBioPortalData_2.20.0       MultiAssayExperiment_1.34.0
##  [7] SummarizedExperiment_1.38.0 Biobase_2.68.0             
##  [9] GenomicRanges_1.60.0        GenomeInfoDb_1.44.0        
## [11] IRanges_2.42.0              S4Vectors_0.46.0           
## [13] BiocGenerics_0.54.0         generics_0.1.3             
## [15] MatrixGenerics_1.20.0       matrixStats_1.5.0          
## [17] AnVIL_1.20.0                AnVILBase_1.2.0            
## [19] dplyr_1.1.4                 BiocStyle_2.36.0           
## 
## loaded via a namespace (and not attached):
##   [1] jsonlite_2.0.0            magrittr_2.0.3           
##   [3] magick_2.8.6              GenomicFeatures_1.60.0   
##   [5] farver_2.1.2              rmarkdown_2.29           
##   [7] BiocIO_1.18.0             vctrs_0.6.5              
##   [9] memoise_2.0.1             Rsamtools_2.24.0         
##  [11] RCurl_1.98-1.17           tinytex_0.57             
##  [13] rstatix_0.7.2             htmltools_0.5.8.1        
##  [15] S4Arrays_1.8.0            BiocBaseUtils_1.10.0     
##  [17] lambda.r_1.2.4            curl_6.2.2               
##  [19] broom_1.0.8               Formula_1.2-5            
##  [21] SparseArray_1.8.0         sass_0.4.10              
##  [23] bslib_0.9.0               htmlwidgets_1.6.4        
##  [25] httr2_1.1.2               zoo_1.8-14               
##  [27] futile.options_1.0.1      cachem_1.1.0             
##  [29] commonmark_1.9.5          GenomicAlignments_1.44.0 
##  [31] mime_0.13                 lifecycle_1.0.4          
##  [33] pkgconfig_2.0.3           Matrix_1.7-3             
##  [35] R6_2.6.1                  fastmap_1.2.0            
##  [37] GenomeInfoDbData_1.2.14   shiny_1.10.0             
##  [39] digest_0.6.37             colorspace_2.1-1         
##  [41] RaggedExperiment_1.32.0   AnnotationDbi_1.70.0     
##  [43] ps_1.9.1                  RSQLite_2.3.9            
##  [45] labeling_0.4.3            filelock_1.0.3           
##  [47] RTCGAToolbox_2.38.0       km.ci_0.5-6              
##  [49] RJSONIO_2.0.0             httr_1.4.7               
##  [51] abind_1.4-8               compiler_4.5.0           
##  [53] bit64_4.6.0-1             withr_3.0.2              
##  [55] backports_1.5.0           BiocParallel_1.42.0      
##  [57] carData_3.0-5             DBI_1.2.3                
##  [59] ggsignif_0.6.4            rappdirs_0.3.3           
##  [61] DelayedArray_0.34.0       rjson_0.2.23             
##  [63] tools_4.5.0               chromote_0.5.0           
##  [65] httpuv_1.6.15             glue_1.8.0               
##  [67] restfulr_0.0.15           promises_1.3.2           
##  [69] gridtext_0.1.5            grid_4.5.0               
##  [71] gtable_0.3.6              KMsurv_0.1-5             
##  [73] tzdb_0.5.0                tidyr_1.3.1              
##  [75] websocket_1.4.4           data.table_1.17.0        
##  [77] hms_1.1.3                 car_3.1-3                
##  [79] xml2_1.3.8                utf8_1.2.4               
##  [81] XVector_0.48.0            markdown_2.0             
##  [83] pillar_1.10.2             stringr_1.5.1            
##  [85] later_1.4.2               splines_4.5.0            
##  [87] ggtext_0.1.2              BiocFileCache_2.16.0     
##  [89] lattice_0.22-7            rtracklayer_1.68.0       
##  [91] bit_4.6.0                 tidyselect_1.2.1         
##  [93] Biostrings_2.76.0         miniUI_0.1.1.1           
##  [95] knitr_1.50                gridExtra_2.3            
##  [97] litedown_0.7              bookdown_0.43            
##  [99] futile.logger_1.4.3       xfun_0.52                
## [101] DT_0.33                   stringi_1.8.7            
## [103] UCSC.utils_1.4.0          yaml_2.3.10              
## [105] evaluate_1.0.3            codetools_0.2-20         
## [107] tibble_3.2.1              BiocManager_1.30.25      
## [109] cli_3.6.4                 xtable_1.8-4             
## [111] munsell_0.5.1             processx_3.8.6           
## [113] jquerylib_0.1.4           survMisc_0.5.6           
## [115] Rcpp_1.0.14               GenomicDataCommons_1.32.0
## [117] dbplyr_2.5.0              png_0.1-8                
## [119] XML_3.99-0.18             rapiclient_0.1.8         
## [121] parallel_4.5.0            TCGAutils_1.28.0         
## [123] readr_2.1.5               blob_1.2.4               
## [125] bitops_1.0-9              scales_1.3.0             
## [127] purrr_1.0.4               crayon_1.5.3             
## [129] rlang_1.1.6               KEGGREST_1.48.0          
## [131] rvest_1.0.4               formatR_1.14

References

Ramos, Marcel, Ludwig Geistlinger, Sehyun Oh, Lucas Schiffer, Rimsha Azhar, Hanish Kodali, Ino de Bruijn, et al. 2020. “Multiomic Integration of Public Oncology Databases in Bioconductor.” JCO Clinical Cancer Informatics 1 (4): 958–71. https://doi.org/10.1200/CCI.19.00119.

Ramos, Marcel, Lucas Schiffer, Angela Re, Rimsha Azhar, Azfar Basunia, Carmen Rodriguez, Tiffany Chan, et al. 2017. “Software for the Integration of Multiomics Experiments in Bioconductor.” Cancer Research 77 (21): e39–e42. https://doi.org/10.1158/0008-5472.CAN-17-0344.