Introduction_4_process_pgxseg (original) (raw)

Contents

Progenetix is an open data resource that provides curated individual cancer copy number variation (CNV) profiles along with associated metadata sourced from published oncogenomic studies and various data repositories. Progenetix uses the “.pgxseg” data format to store variant data, which encompasses CNV (Copy Number Variation) and SNV (Single Nucleotide Variant), as well as the metadata of associated samples. This vignette describes how to work with local “.pgxseg” files using this package. For more details about the “.pgxseg” file format, please refer to the the documentation.

Load library

library(pgxRpi)
library(GenomicRanges) # for pgxfreq object

pgxSegprocess function

This function extracts segment variants, CNV frequency, and metadata from local “pgxseg” files. Additionally, it supports survival data visualization if survival data is available within the file.

The parameters of this function used in this tutorial:

Calculate CNV frequency

The CNV frequency is calculated from segments of samples with the same group id. The group id is specified in group_id parameter. More details about CNV frequency see the vignette Introduction_3_access_cnv_frequency.

# Default is "group_id" in metadata
frequency <- pgxSegprocess(file=file_name,return_frequency = TRUE) 
# Use different ids for grouping
frequency_2 <- pgxSegprocess(file=file_name,return_frequency = TRUE, 
                             group_id ='icdo_morphology_id')
frequency
#> GRangesList object of length 4:
#> $`pgx:icdot-C16.9`
#> GRanges object with 3106 ranges and 2 metadata columns:
#>          seqnames            ranges strand | gain_frequency loss_frequency
#>             <Rle>         <IRanges>  <Rle> |      <numeric>      <numeric>
#>      [1]        1          0-400000      * |              0        28.5714
#>      [2]        1    400000-1400000      * |              0        28.5714
#>      [3]        1   1400000-2400000      * |              0        28.5714
#>      [4]        1   2400000-3400000      * |              0        28.5714
#>      [5]        1   3400000-4400000      * |              0        28.5714
#>      ...      ...               ...    ... .            ...            ...
#>   [3102]        Y 52400000-53400000      * |              0              0
#>   [3103]        Y 53400000-54400000      * |              0              0
#>   [3104]        Y 54400000-55400000      * |              0              0
#>   [3105]        Y 55400000-56400000      * |              0              0
#>   [3106]        Y 56400000-57227415      * |              0              0
#>   -------
#>   seqinfo: 24 sequences from an unspecified genome; no seqlengths
#> 
#> ...
#> <3 more elements>

The returned object is same as the CNV frequency object with “pgxfreq” format returned by pgxLoader function. The CNV frequency is calculated from groups which exist in both metadata and segment data. It is noted that not all groups in metadata must exist in segment data (e.g. some samples don’t have CNV calls).

head(frequency[["pgx:icdot-C16.9"]])
#> GRanges object with 6 ranges and 2 metadata columns:
#>       seqnames          ranges strand | gain_frequency loss_frequency
#>          <Rle>       <IRanges>  <Rle> |      <numeric>      <numeric>
#>   [1]        1        0-400000      * |              0        28.5714
#>   [2]        1  400000-1400000      * |              0        28.5714
#>   [3]        1 1400000-2400000      * |              0        28.5714
#>   [4]        1 2400000-3400000      * |              0        28.5714
#>   [5]        1 3400000-4400000      * |              0        28.5714
#>   [6]        1 4400000-5400000      * |              0        28.5714
#>   -------
#>   seqinfo: 24 sequences from an unspecified genome; no seqlengths

The associated metadata in CNV frequency objects looks like this

mcols(frequency)
#> DataFrame with 4 rows and 3 columns
#>                        group_id            group_label sample_count
#>                     <character>            <character>    <integer>
#> pgx:icdot-C16.9 pgx:icdot-C16.9                stomach           28
#> pgx:icdot-C73.9 pgx:icdot-C73.9          thyroid gland            6
#> pgx:icdot-C50.9 pgx:icdot-C50.9                 breast            2
#> pgx:icdot-C49.9 pgx:icdot-C49.9 connective and soft ..            7
mcols(frequency_2)
#> DataFrame with 3 rows and 3 columns
#>                        group_id            group_label sample_count
#>                     <character>            <character>    <integer>
#> pgx:icdom-89363 pgx:icdom-89363 Gastrointestinal str..           35
#> pgx:icdom-83353 pgx:icdom-83353   Follicular carcinoma            6
#> pgx:icdom-80753 pgx:icdom-80753 Squamous cell carcin..            2

You can visualize the CNV frequency of the interesting group using pgxFreqplot function. For more details on this function, see the vignette Introduction_3_access_cnv_frequency.

pgxFreqplot(frequency, filters="pgx:icdot-C16.9")
pgxFreqplot(frequency, filters="pgx:icdot-C16.9",chrom = c(1,8,14), layout = c(3,1))
pgxFreqplot(frequency, filters=c("pgx:icdot-C16.9","pgx:icdot-C73.9"),circos = TRUE)

Session Info

#> R version 4.5.0 RC (2025-04-04 r88126)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.2 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] SummarizedExperiment_1.38.0 Biobase_2.68.0             
#>  [3] GenomicRanges_1.60.0        GenomeInfoDb_1.44.0        
#>  [5] IRanges_2.42.0              S4Vectors_0.46.0           
#>  [7] BiocGenerics_0.54.0         generics_0.1.3             
#>  [9] MatrixGenerics_1.20.0       matrixStats_1.5.0          
#> [11] future_1.40.0               pgxRpi_1.4.0               
#> [13] BiocStyle_2.36.0           
#> 
#> loaded via a namespace (and not attached):
#>  [1] tidyselect_1.2.1        dplyr_1.1.4             farver_2.1.2           
#>  [4] fastmap_1.2.0           digest_0.6.37           timechange_0.3.0       
#>  [7] lifecycle_1.0.4         survival_3.8-3          magrittr_2.0.3         
#> [10] compiler_4.5.0          rlang_1.1.6             sass_0.4.10            
#> [13] tools_4.5.0             yaml_2.3.10             data.table_1.17.0      
#> [16] knitr_1.50              ggsignif_0.6.4          S4Arrays_1.8.0         
#> [19] labeling_0.4.3          curl_6.2.2              DelayedArray_0.34.0    
#> [22] abind_1.4-8             withr_3.0.2             purrr_1.0.4            
#> [25] grid_4.5.0              ggpubr_0.6.0            xtable_1.8-4           
#> [28] colorspace_2.1-1        ggplot2_3.5.2           globals_0.16.3         
#> [31] scales_1.3.0            tinytex_0.57            cli_3.6.4              
#> [34] crayon_1.5.3            rmarkdown_2.29          future.apply_1.11.3    
#> [37] km.ci_0.5-6             httr_1.4.7              survminer_0.5.0        
#> [40] cachem_1.1.0            splines_4.5.0           parallel_4.5.0         
#> [43] BiocManager_1.30.25     XVector_0.48.0          survMisc_0.5.6         
#> [46] vctrs_0.6.5             Matrix_1.7-3            jsonlite_2.0.0         
#> [49] carData_3.0-5           bookdown_0.43           car_3.1-3              
#> [52] rstatix_0.7.2           Formula_1.2-5           listenv_0.9.1          
#> [55] magick_2.8.6            attempt_0.3.1           tidyr_1.3.1            
#> [58] jquerylib_0.1.4         glue_1.8.0              parallelly_1.43.0      
#> [61] codetools_0.2-20        shape_1.4.6.1           lubridate_1.9.4        
#> [64] gtable_0.3.6            UCSC.utils_1.4.0        munsell_0.5.1          
#> [67] tibble_3.2.1            pillar_1.10.2           htmltools_0.5.8.1      
#> [70] circlize_0.4.16         GenomeInfoDbData_1.2.14 R6_2.6.1               
#> [73] KMsurv_0.1-5            evaluate_1.0.3          lattice_0.22-7         
#> [76] backports_1.5.0         ggsci_3.2.0             broom_1.0.8            
#> [79] bslib_0.9.0             Rcpp_1.0.14             SparseArray_1.8.0      
#> [82] gridExtra_2.3           xfun_0.52               GlobalOptions_0.1.2    
#> [85] zoo_1.8-14              pkgconfig_2.0.3