Introduction_1_load_metadata (original) (raw)
Contents
- 1 Load library
- 2 Retrieve biosamples information
- 3 Retrieve individuals information
- 4 Retrieve analyses information
- 5 Retrieve the number of results for a specific filter
- 6 Query from multiple data resources
- 7 Visualization of survival data
- 8 Session Info
Progenetix is an open data resource that provides curated individual cancer copy number variation (CNV) profiles along with associated metadata sourced from published oncogenomic studies and various data repositories. This vignette provides a comprehensive guide on accessing and utilizing metadata for samples or their corresponding individuals within the Progenetix database.
If your focus lies in cancer cell lines, you can access data from cancercelllines.org by setting the domain
parameter to "cancercelllines.org"
in pgxLoader
function. This data repository originates from CNV profiling data of cell lines initially collected as part of Progenetix and currently includes additional types of genomic mutations.
Load library
library(pgxRpi)
pgxLoader
function
This function loads various data from Progenetix
database via the Beacon v2 API with some extensions (BeaconPlus).
The parameters of this function used in this tutorial:
type
: A string specifying output data type."individuals"
,"biosamples"
,"analyses"
,"filtering_terms"
, and"counts"
are used in this tutorial.filters
: Identifiers used in public repositories, bio-ontology terms, or custom terms such asc("NCIT:C7376", "pgx:icdom-85003")
. When multiple filters are used, they are combined using AND logic when the parametertype
is"individuals"
,"biosamples"
, or"analyses"
; OR logic when the parametertype
is"counts"
.individual_id
: Identifiers used in the query database for identifying individuals.biosample_id
: Identifiers used in the query database for identifying biosamples.codematches
: A logical value determining whether to exclude samples from child concepts of specified filters in the ontology tree. IfTRUE
, only samples exactly matching the specified filters will be included. Do not use this parameter whenfilters
include ontology-irrelevant filters such as pubmed or cohort identifiers. Default isFALSE
.filter_pattern
: Optional string pattern to match against thelabel
field of available filters. Only used when the parameter type is"filtering_terms"
. Default isNULL
, which includes all filters.limit
: Integer to specify the number of returned profiles. Default is0
(return all).skip
: Integer to specify the number of skipped profiles. E.g. ifskip = 2, limit=500
, the first 2*500=1000 profiles are skipped and the next 500 profiles are returned. Default is0
(no skip).use_https
: A logical value indicating whether to use the HTTPS protocol. IfTRUE
, the domain will be prefixed with"https://"
; otherwise,"http://"
will be used. Default isTRUE
.dataset
: A string specifying the dataset to query from the Beacon response. Default isNULL
, which includes results from all datasets.domain
: A string specifying the domain of the query data resource. Default is"progenetix.org"
.entry_point
: A string specifying the entry point of the Beacon v2 API. Default is"beacon"
, resulting in the endpoint being"https://progenetix.org/beacon"
.num_cores
: An integer specifying the number of cores to use for parallel processing during Beacon v2 phenotypic/meta-data queries from multiple domains. Default is1
.
Retrieve biosamples information
Search by filters
Filters are a significant enhancement to the Beacon query API, providing a mechanism for specifying rules to select records based on their field values. To learn more about how to utilize filters in Progenetix, please refer to the documentation.
The following example demonstrates how to access all available filters in Progenetix:
all_filters <- pgxLoader(type="filtering_terms")
head(all_filters)
#> id label type scopes
#> 1 EDAM:operation_3227 EDAM:operation_3227 ontologyTerm NA
#> 2 EDAM:operation_3961 EDAM:operation_3961 ontologyTerm NA
#> 3 labelSeg-based calibration labelSeg-based calibration alphanumeric NA
#> 4 NCIT:C28076 Disease Grade Qualifier ontologyTerm NA
#> 5 NCIT:C18000 Histologic Grade ontologyTerm NA
#> 6 NCIT:C14158 High Grade ontologyTerm NA
If you’re interested in filters related to a specific disease or phenotype, you can use the filter_pattern
argument to narrow down the list. For example, to search for filters related to retinoblastoma:
query_filter <- pgxLoader(type="filtering_terms",filter_pattern="retinoblastoma")
query_filter
#> id label type scopes
#> 1 NCIT:C7541 Retinoblastoma ontologyTerm NA
#> 2 NCIT:C8713 Bilateral Retinoblastoma ontologyTerm NA
#> 3 NCIT:C8714 Unilateral Retinoblastoma ontologyTerm NA
#> 4 pgx:icdom-95103 Retinoblastoma, NOS ontologyTerm NA
To retrieve biosamples associated with a specific disease, use appropriate filter terms. In this example, we use an NCIt code corresponding to retinoblastoma (NCIT:C7541
):
biosamples <- pgxLoader(type="biosamples", filters = "NCIT:C7541")
# data looks like this
biosamples[1:5,]
#> biosample_id individual_id biosample_status_id biosample_status_label
#> 1 pgxbs-kftvhcom pgxind-kftx39mg EFO:0009656 neoplastic sample
#> 2 pgxbs-kftvh1nj pgxind-kftx2vuj EFO:0009656 neoplastic sample
#> 3 pgxbs-kftvhaw7 pgxind-kftx37e4 EFO:0009656 neoplastic sample
#> 4 pgxbs-m3io5l34 pgxind-m3io5l34 EFO:0009656 neoplastic sample
#> 5 pgxbs-kftvl6rz pgxind-kftx7ga8 EFO:0009656 neoplastic sample
#> sample_origin_type_id sample_origin_type_label histological_diagnosis_id
#> 1 OBI:0001479 specimen from organism NCIT:C7541
#> 2 OBI:0001479 specimen from organism NCIT:C7541
#> 3 OBI:0001479 specimen from organism NCIT:C8714
#> 4 OBI:0001479 specimen from organism NCIT:C7541
#> 5 OBI:0001479 specimen from organism NCIT:C7541
#> histological_diagnosis_label sampled_tissue_id sampled_tissue_label
#> 1 Retinoblastoma UBERON:0000966 retina
#> 2 Retinoblastoma UBERON:0000966 retina
#> 3 Unilateral Retinoblastoma UBERON:0000966 retina
#> 4 Retinoblastoma UBERON:0000966 retina
#> 5 Retinoblastoma UBERON:0000966 retina
#> pathological_stage_id pathological_stage_label tnm tumor_grade age_iso info
#> 1 NCIT:C92207 Stage Unknown NA NA <NA> NA
#> 2 NCIT:C92207 Stage Unknown NA NA <NA> NA
#> 3 NCIT:C92207 Stage Unknown NA NA <NA> NA
#> 4 NCIT:C92207 Stage Unknown NA NA <NA> NA
#> 5 NCIT:C92207 Stage Unknown NA NA P69Y NA
#> notes icdo_morphology_id icdo_morphology_label
#> 1 Retinoblastoma [76.2 months] pgx:icdom-95103 Retinoblastoma, NOS
#> 2 Retinoblastoma pgx:icdom-95103 Retinoblastoma, NOS
#> 3 retinoblastoma [unilateral] pgx:icdom-95103 Retinoblastoma, NOS
#> 4 Retinoblastoma pgx:icdom-95103 Retinoblastoma, NOS
#> 5 Retinoblastoma pgx:icdom-95103 Retinoblastoma, NOS
#> icdo_topography_id icdo_topography_label
#> 1 pgx:icdot-C69.2 Retina
#> 2 pgx:icdot-C69.2 Retina
#> 3 pgx:icdot-C69.2 Retina
#> 4 pgx:icdot-C69.2 Retina
#> 5 pgx:icdot-C69.2 Retina
#> external_references_description
#> 1 Herzog S, Lohmann DR et al. (2001): Marked differences in unilateral isolated retinoblastomas...
#> 2 Zielinski B, Gratias S et al. (2005): Detection of chromosomal imbalances in retinoblastoma...
#> 3 Chen D, Gallie BL et al. (2001): Minimal regions of chromosomal imbalance in...
#> 4 Francis et al. Cancers 2021,Targeted sequencing of 83 Retinoblastoma tumor-normal pairs via MSK-IMPACT. Genomic data provided is limited to somatic alterations.
#> 5 Zehir A, Benayed R et al. (2017): Mutational landscape of metastatic cancer revealed...,Targeted sequencing of 10,000 clinical cases using the MSK-IMPACT assay
#> external_references_id
#> 1 pubmed:11281459
#> 2 pubmed:15834944
#> 3 pubmed:11520568
#> 4 pubmed:33466343,cbioportal:rbl_mskcc_2020
#> 5 pubmed:28481359,cbioportal:msk_impact_2017
#> external_references_reference
#> 1 https://europepmc.org/article/MED/11281459
#> 2 https://europepmc.org/article/MED/15834944
#> 3 https://europepmc.org/article/MED/11520568
#> 4 https://europepmc.org/article/MED/33466343,https://www.cbioportal.org/study/summary?id=rbl_mskcc_2020
#> 5 https://europepmc.org/article/MED/28481359,https://www.cbioportal.org/study/summary?id=msk_impact_2017
#> analysis_info
#> 1 NA
#> 2 NA
#> 3 NA
#> 4 NA
#> 5 NA
#> cohorts_id
#> 1 pgx:cohort-2021progenetix
#> 2 pgx:cohort-2021progenetix
#> 3 pgx:cohort-2021progenetix
#> 4 cbioportal:rbl_mskcc_2020,PMID:33466343
#> 5 cbioportal:msk_impact_2017,cbioportal:rbl_mskcc_2020,PMID:28481359,PMID:33466343
#> cohorts_label
#> 1 Version at Progenetix Update 2021
#> 2 Version at Progenetix Update 2021
#> 3 Version at Progenetix Update 2021
#> 4 Targeted sequencing of 83 Retinoblastoma tumor-normal pairs via MSK-IMPACT. Genomic data provided is limited to somatic alterations.,Francis et al. Cancers 2021
#> 5 Targeted sequencing of 10,000 clinical cases using the MSK-IMPACT assay,Targeted sequencing of 83 Retinoblastoma tumor-normal pairs via MSK-IMPACT. Genomic data provided is limited to somatic alterations.,Zehir A, Benayed R et al. (2017): Mutational landscape of metastatic cancer revealed...,Francis et al. Cancers 2021
#> geo_location_geometry_coordinates geo_location_geometry_type
#> 1 8.77,50.8 Point
#> 2 8.69,49.41 Point
#> 3 -79.42,43.7 Point
#> 4 <NA> <NA>
#> 5 -74.01,40.71 Point
#> geo_location_properties_iso3166alpha3 geo_location_properties_city
#> 1 DEU Marburg
#> 2 DEU Heidelberg
#> 3 CAN Toronto
#> 4 <NA> <NA>
#> 5 USA New York City
#> geo_location_properties_country geo_location_properties_label
#> 1 Germany Marburg, Germany
#> 2 Germany Heidelberg, Germany
#> 3 Canada Toronto, Canada
#> 4 <NA> <NA>
#> 5 United States of America New York City, United States of America
#> geo_location_properties_latitude geo_location_properties_longitude
#> 1 50.8 8.77
#> 2 49.41 8.69
#> 3 43.7 -79.42
#> 4 <NA> <NA>
#> 5 40.71 -74.01
#> geo_location_properties_precision geo_location_type
#> 1 city Feature
#> 2 city Feature
#> 3 city Feature
#> 4 <NA> <NA>
#> 5 city Feature
#> updated geo_location analysis_info_experiment_id
#> 1 2020-09-10 17:44:41.343000 NA <NA>
#> 2 2020-09-10 17:44:29.163000 NA <NA>
#> 3 2020-09-10 17:44:39.367000 NA <NA>
#> 4 2025-04-07T11:07:20.658121 NA <NA>
#> 5 2024-11-19T03:38:23.957596 NA <NA>
#> analysis_info_platform_id analysis_info_series_id
#> 1 <NA> <NA>
#> 2 <NA> <NA>
#> 3 <NA> <NA>
#> 4 <NA> <NA>
#> 5 <NA> <NA>
The data contains many columns representing different aspects of sample information.
Search by biosample id and individual id
In the Beacon v2 specification, biosample id and individual id are unique identifiers for biosamples and their corresponding individuals, respectively. These identifiers can be obtained through metadata searches using filters as described above or by querying the Progenetix search interface, which provides access to the IDs used in the Progenetix database.
biosamples_2 <- pgxLoader(type="biosamples", biosample_id = "pgxbs-kftvki7h",individual_id = "pgxind-kftx6ltu")
biosamples_2
#> biosample_id individual_id biosample_status_id biosample_status_label
#> 1 pgxbs-kftvki7h pgxind-kftx6ltd EFO:0009656 neoplastic sample
#> 2 pgxbs-kftvki7v pgxind-kftx6ltu EFO:0009656 neoplastic sample
#> sample_origin_type_id sample_origin_type_label histological_diagnosis_id
#> 1 OBI:0001479 specimen from organism NCIT:C3512
#> 2 OBI:0001479 specimen from organism NCIT:C3512
#> histological_diagnosis_label sampled_tissue_id sampled_tissue_label
#> 1 Lung Adenocarcinoma UBERON:0002048 lung
#> 2 Lung Adenocarcinoma UBERON:0002048 lung
#> pathological_stage_id pathological_stage_label
#> 1 NCIT:C27976 Stage Ib
#> 2 NCIT:C27977 Stage IIIa
#> tnm_id
#> 1 NCIT:C48706,NCIT:C48714,NCIT:C48724
#> 2 NCIT:C48706,NCIT:C48714,NCIT:C48728
#> tnm_label tumor_grade age_iso info
#> 1 N1 Stage Finding,N3 Stage Finding,T2 Stage Finding NA P56Y NA
#> 2 N1 Stage Finding,N3 Stage Finding,T3 Stage Finding NA P75Y NA
#> notes icdo_morphology_id icdo_morphology_label
#> 1 adenocarcinoma [lung] pgx:icdom-81403 Adenocarcinoma, NOS
#> 2 adenocarcinoma [lung] pgx:icdom-81403 Adenocarcinoma, NOS
#> icdo_topography_id icdo_topography_label
#> 1 pgx:icdot-C34.9 Lung, NOS
#> 2 pgx:icdot-C34.9 Lung, NOS
#> external_references_description
#> 1 Kang JU, Koo SH et al. (2009): Identification of novel candidate target genes,...
#> 2 Kang JU, Koo SH et al. (2009): Identification of novel candidate target genes,...
#> external_references_id external_references_reference
#> 1 pubmed:19607727 https://europepmc.org/article/MED/19607727
#> 2 pubmed:19607727 https://europepmc.org/article/MED/19607727
#> analysis_info_experiment_id analysis_info_platform_id analysis_info_series_id
#> 1 geo:GSM417055 geo:GPL8690 geo:GSE16597
#> 2 geo:GSM417063 geo:GPL8690 geo:GSE16597
#> cohorts_id
#> 1 pgx:cohort-arraymap,pgx:cohort-2021progenetix,pgx:cohort-carriocordo2021heterogeneity
#> 2 pgx:cohort-arraymap,pgx:cohort-2021progenetix
#> cohorts_label
#> 1 arrayMap collection,Version at Progenetix Update 2021,Carrio-Cordo and Baudis - Genomic Heterogeneity in Cancer Types (2021)
#> 2 arrayMap collection,Version at Progenetix Update 2021
#> geo_location_geometry_coordinates geo_location_geometry_type
#> 1 -74.01,40.71 Point
#> 2 -74.01,40.71 Point
#> geo_location_properties_iso3166alpha3 geo_location_properties_city
#> 1 USA New York City
#> 2 USA New York City
#> geo_location_properties_country geo_location_properties_label
#> 1 United States of America New York City, United States
#> 2 United States of America New York City, United States
#> geo_location_properties_latitude geo_location_properties_longitude
#> 1 40.71 -74.01
#> 2 40.71 -74.01
#> geo_location_properties_precision geo_location_type
#> 1 city Feature
#> 2 city Feature
#> updated
#> 1 2020-09-10 17:46:45.105000
#> 2 2020-09-10 17:46:45.115000
It’s also possible to query by a combination of filters, biosample id, and individual id.
Access a subset of samples
By default, it returns all related samples (limit=0). You can access a subset of them via the parameter limit
and skip
. For example, if you want to access the first 10 samples , you can set limit
= 10, skip
= 0.
biosamples_3 <- pgxLoader(type="biosamples", filters = "NCIT:C7541",skip=0, limit = 10)
# Dimension: Number of samples * features
print(dim(biosamples))
#> [1] 256 42
print(dim(biosamples_3))
#> [1] 10 42
Parameter codematches
use
Some filters, such as NCIt codes, are hierarchical. As a result, retrieved samples may include not only the specified filters but also their child terms.
unique(biosamples$histological_diagnosis_id)
#> [1] "NCIT:C7541" "NCIT:C8714" "NCIT:C8713"
Setting codematches
as TRUE allows this function to only return biosamples that exactly match the specified filter, excluding child terms.
biosamples_4 <- pgxLoader(type="biosamples", filters = "NCIT:C7541",codematches = TRUE)
unique(biosamples_4$histological_diagnosis_id)
#> [1] "NCIT:C7541"
Retrieve individuals information
If you want to query details of individuals (e.g. clinical data) where the samples of interest come from, set the parameter type
to “individuals” and follow the same steps as above.
individuals <- pgxLoader(type="individuals",individual_id = "pgxind-kftx26ml",filters="NCIT:C7541")
# data looks like this
tail(individuals,2)
#> individual_id sex_id sex_label age_iso histological_diagnosis_id
#> 254 pgxind-m3io5l4s NCIT:C20197 male <NA> NCIT:C7541
#> 255 pgxind-kftx26ml NCIT:C20197 male <NA> NCIT:C3493
#> histological_diagnosis_label followup_time followup_state_id
#> 254 Retinoblastoma <NA> EFO:0030039
#> 255 Squamous Cell Lung Carcinoma <NA> EFO:0030039
#> followup_state_label diseases_notes info
#> 254 no followup status <NA> <NA>
#> 255 no followup status <NA> PGX_IND_AdSqLu-bjo-01
#> updated info_legacy_ids info_provenance
#> 254 2024-11-19T03:41:19.984636 <NA> <NA>
#> 255 2018-09-26 09:50:52.800000 <NA> <NA>
Retrieve analyses information
If you want to know more details about data analyses, set the parameter type
to “analyses”. The other steps are the same, except the parameter codematches
is not available because analyses data do not include filter information, even though it can be searched by filters.
analyses <- pgxLoader(type="analyses",biosample_id = c("pgxbs-kftvik5i","pgxbs-kftvik96"))
analyses
#> analysis_id biosample_id individual_id calling_pipeline
#> 1 pgxcs-kftw8qme pgxbs-kftvik5i pgxind-kftx4963 progenetix
#> 2 pgxcs-kftw8rrh pgxbs-kftvik96 pgxind-kftx49ao progenetix
#> analysis_info_experiment_id analysis_info_experiment_title
#> 1 geo:GSM115217 GSM115217
#> 2 geo:GSM120460 GSM120460
#> analysis_info_operation_id analysis_info_operation_label
#> 1 EDAM:operation_3961 Copy number variation detection
#> 2 EDAM:operation_3961 Copy number variation detection
#> analysis_info_series_id platform_id platform_label
#> 1 geo:GSE5051 geo:GPL2826 VUMC MACF human 30K oligo v31
#> 2 geo:GSE5359 geo:GPL3960 MPIMG Homo sapiens 44K aCGH3_MPIMG_BERLIN
#> updated
#> 1 2025-01-16T09:11:21.724116
#> 2 2025-01-16T09:11:22.408194
Retrieve the number of results for a specific filter
To retrieve the number of results for a specific filter in Progenetix, set the type
parameter to "counts"
. You can query different Beacon v2 resources by setting the domain
and entry_point
parameters accordingly.
pgxLoader(type="counts",filters = "NCIT:C7541")
#> filter entity count
#> 1 NCIT:C7541 individuals 254
#> 2 NCIT:C7541 biosamples 256
#> 3 NCIT:C7541 analyses 276
Query from multiple data resources
You can query data from multiple resources via the Beacon v2 API by setting the domain
and entry_point
parameters accordingly. To speed up the process, use the num_cores
parameter to enable parallel processing across different domains. For resources that only support http
(e.g., local or internal network instances), set use_https = FALSE
to avoid connection issues.
record_counts <- pgxLoader(type="counts",filters = "NCIT:C9245",domain=c("progenetix.org","cancercelllines.org"), entry_point=c("beacon","beacon"))
record_counts
#> $progenetix.org
#> filter entity count
#> 1 NCIT:C9245 individuals 4227
#> 2 NCIT:C9245 biosamples 4645
#> 3 NCIT:C9245 analyses 5638
#>
#> $cancercelllines.org
#> filter entity count
#> 1 NCIT:C9245 individuals 42
#> 2 NCIT:C9245 biosamples 1005
#> 3 NCIT:C9245 analyses 929
Visualization of survival data
Suppose you want to investigate whether there are survival differences associated with a particular disease, for example, between younger and older patients, or based on other variables. You can query and visualize the relevant information using the pgxMetaplot
function.
Session Info
#> R version 4.5.0 RC (2025-04-04 r88126)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.2 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] future_1.49.0 pgxRpi_1.4.3 BiocStyle_2.36.0
#>
#> loaded via a namespace (and not attached):
#> [1] gtable_0.3.6 xfun_0.52 bslib_0.9.0
#> [4] ggplot2_3.5.2 rstatix_0.7.2 lattice_0.22-7
#> [7] vctrs_0.6.5 tools_4.5.0 generics_0.1.4
#> [10] curl_6.2.2 parallel_4.5.0 tibble_3.2.1
#> [13] pkgconfig_2.0.3 Matrix_1.7-3 data.table_1.17.2
#> [16] RColorBrewer_1.1-3 lifecycle_1.0.4 compiler_4.5.0
#> [19] farver_2.1.2 tinytex_0.57 codetools_0.2-20
#> [22] carData_3.0-5 htmltools_0.5.8.1 sass_0.4.10
#> [25] yaml_2.3.10 Formula_1.2-5 pillar_1.10.2
#> [28] car_3.1-3 ggpubr_0.6.0 jquerylib_0.1.4
#> [31] tidyr_1.3.1 cachem_1.1.0 survminer_0.5.0
#> [34] magick_2.8.6 abind_1.4-8 km.ci_0.5-6
#> [37] parallelly_1.44.0 tidyselect_1.2.1 digest_0.6.37
#> [40] dplyr_1.1.4 purrr_1.0.4 bookdown_0.43
#> [43] listenv_0.9.1 labeling_0.4.3 splines_4.5.0
#> [46] fastmap_1.2.0 grid_4.5.0 cli_3.6.5
#> [49] magrittr_2.0.3 dichromat_2.0-0.1 survival_3.8-3
#> [52] broom_1.0.8 future.apply_1.11.3 withr_3.0.2
#> [55] scales_1.4.0 backports_1.5.0 lubridate_1.9.4
#> [58] timechange_0.3.0 rmarkdown_2.29 httr_1.4.7
#> [61] globals_0.18.0 gridExtra_2.3 ggsignif_0.6.4
#> [64] zoo_1.8-14 evaluate_1.0.3 knitr_1.50
#> [67] KMsurv_0.1-6 survMisc_0.5.6 rlang_1.1.6
#> [70] Rcpp_1.0.14 xtable_1.8-4 glue_1.8.0
#> [73] BiocManager_1.30.25 attempt_0.3.1 jsonlite_2.0.0
#> [76] R6_2.6.1