03 Working with UK Biobank summary statistics (original) (raw)

Contents

Overview

In this document we illustrate some approaches to working with UK Biobank summary statistics. Be sure that that the python module ukbb_pan_ancestryhas been installed where reticulate can find it. (We don’t use basilisk as of 12/24/2022 because of issues in the terra spark cluster.)

If the above command indicates that BiocHail is not available, see the Installation section near the end of this document.

Initialization and description

Standalone

We have produced a representation of summary statistics for a sample of 9888 loci. This 5GB resource can be retrieved and cached with the following code:

hl = hail_init()
ss = get_ukbb_sumstat_10kloci_mt(hl) # can take about a minute to unzip 5GB
ss$count()   # but if a persistent MatrixTable is at the location given
             # by env var HAIL_UKBB_SUMSTAT_10K_PATH it goes quickly

To get a description of available content, we need a python chunk:

Terra

Here’s a basic description of the summary stats table, with code that works in terra.bio:

hl = bare_hail()
hl$init(idempotent=TRUE, spark_conf=list(
  'spark.hadoop.fs.gs.requester.pays.mode'= 'CUSTOM',
  'spark.hadoop.fs.gs.requester.pays.buckets'= 'ukb-diverse-pops-public',
  'spark.hadoop.fs.gs.requester.pays.project.id'= Sys.getenv("GOOGLE_PROJECT")))

We need to use a python chunk to get the output, using gs:// storage references.

r.hl.read_matrix_table('gs://ukb-diverse-pops-public/sumstats_release/results_full.mt').describe()

Exploring the subset

Summary statistics

The summary statistics themselves reside in entries of the MatrixTable. This can be expensive to collect and so filtering methods beyond random sampling must be mastered. But here is a basic view.

sse = sss$entries()$collect()
length(sse)
names(sse[[1]])
sse1 = sse[[1]]

The summary_stats component has the association p-values – log10 transformed?

length(sse1$summary_stats)
names(sse1$summary_stats[[1]])
sse1$summary_stats[[1]]$Pvalue

Exercises

Infrastructure

Substantive

Installing BiocHail

BiocHail should be installed as follows:

if (!require("BiocManager"))
    install.packages("BiocManager")
BiocManager::install("BiocHail")

SessionInfo

## R version 4.5.0 (2025-04-11)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.2 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.22-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ggplot2_3.5.2    BiocHail_1.9.0   BiocStyle_2.37.0
## 
## loaded via a namespace (and not attached):
##  [1] rappdirs_0.3.3       sass_0.4.10          generics_0.1.4      
##  [4] RSQLite_2.3.11       lattice_0.22-7       digest_0.6.37       
##  [7] magrittr_2.0.3       RColorBrewer_1.1-3   evaluate_1.0.3      
## [10] grid_4.5.0           bookdown_0.43        fastmap_1.2.0       
## [13] blob_1.2.4           jsonlite_2.0.0       Matrix_1.7-3        
## [16] DBI_1.2.3            tinytex_0.57         BiocManager_1.30.25 
## [19] scales_1.4.0         httr2_1.1.2          jquerylib_0.1.4     
## [22] cli_3.6.5            rlang_1.1.6          dbplyr_2.5.0        
## [25] bit64_4.6.0-1        withr_3.0.2          cachem_1.1.0        
## [28] yaml_2.3.10          tools_4.5.0          dir.expiry_1.17.0   
## [31] parallel_4.5.0       memoise_2.0.1        dplyr_1.1.4         
## [34] filelock_1.0.3       basilisk_1.21.2      BiocGenerics_0.55.0 
## [37] curl_6.2.2           reticulate_1.42.0    vctrs_0.6.5         
## [40] R6_2.6.1             png_0.1-8            magick_2.8.6        
## [43] BiocFileCache_2.99.3 lifecycle_1.0.4      bit_4.6.0           
## [46] pkgconfig_2.0.3      gtable_0.3.6         pillar_1.10.2       
## [49] bslib_0.9.0          glue_1.8.0           Rcpp_1.0.14         
## [52] xfun_0.52            tibble_3.2.1         tidyselect_1.2.1    
## [55] dichromat_2.0-0.1    knitr_1.50           farver_2.1.2        
## [58] htmltools_0.5.8.1    rmarkdown_2.29       compiler_4.5.0