VCFArray: DelayedArray objects with on-disk/remote VCF backend (original) (raw)
Introduction
VCFArray is a Bioconductor package that represents VCF files as objects derived from the DelayedArray package and DelayedArray
class. It converts data entries from VCF file into aDelayedArray
-derived data structure. The backend VCF file could either be saved on-disk locally or remote as online resources. Data entries that could be extracted include the fixed data fields (REF, ALT, QUAL, FILTER), information field (e.g., AA, AF…), and the individual format field (e.g., GT, DP…). The array data generated from fixed/information fields are one-dimensional VCFArray
, with the dimension being the length of the variants. The array data generated from individual FORMAT
field are always returned with the first dimension being variants
and the second dimension beingsamples
. This feature is consistent with the assay data saved inSummarizedExperiment
, and makes the VCFArray
package interoperable with other established Bioconductor data infrastructure.
Installation
- Download the package.
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("VCFArray")
The development version is also available to download from Github.
BiocManager::install("Bioconductor/VCFArray")
- Load the package into R session.
library(VCFArray)
VCFArray constructor
To construct a VCFArray
object, 4 arguments are needed: file
,vindex
and name
, and pfix
. The file
argument could take either a character string (VCF file name), or VcfFile
object, or aRangedVcfStack
object. name
argument must be specified to indicate which data entry we want to extract from the input file. It’s case-sensitive, and must be consistent with the names from VCF header file. vindex
argument will only be used to indicate the file path of the index file if it does not exist. pfix
is used to spefify the category that the name
field belongs to. NOTE that the pfix
needs to be provided specifically when there are same name
in multiple categories, otherwise, error will return.
The vcfFields()
method takes the VCF file path, VcfFile
object orRangedVcfStack
object as input, and returns a CharacterList with all available VCF fields within specific categories. Users should consult the fixed
, info
and geno
category for available data entries that could be converted into VCFArray
instances. The data entry names can be used as input for the name
argument in VCFArray
constructor.
args(VCFArray)
#> function (file, vindex = character(), name = NA, pfix = NULL)
#> NULL
fl <- system.file("extdata", "chr22.vcf.gz", package = "VariantAnnotation")
library(VariantAnnotation)
vcfFields(fl)
#> CharacterList of length 4
#> [["fixed"]] REF ALT QUAL FILTER
#> [["info"]] LDAF AVGPOST RSQ ERATE THETA ... ASN_AF AFR_AF EUR_AF VT SNPSOURCE
#> [["geno"]] GT DS GL
#> [["samples"]] HG00096 HG00097 HG00099 HG00100 HG00101
Since the index file for our vcf file already exists, the vindex
argument would not be needed (which is the most common case for on-disk VCF files). So we can construct the VCFArray
object for theGT
data entry in the provided VCF file with arguments of file
andname
only.
VCFArray(file = fl, name = "GT")
#> <10376 x 5> VCFMatrix object of type "character":
#> HG00096 HG00097 HG00099 HG00100 HG00101
#> rs7410291 "0|0" "0|0" "1|0" "0|0" "0|0"
#> rs147922003 "0|0" "0|0" "0|0" "0|0" "0|0"
#> rs114143073 "0|0" "0|0" "0|0" "0|0" "0|0"
#> ... . . . . .
#> rs5770892 "0|0" "0|0" "0|0" "0|0" "0|1"
#> rs144055359 "0|0" "0|0" "0|0" "0|0" "0|0"
#> rs114526001 "0|0" "0|0" "0|0" "0|0" "0|0"
We can also construct a VCFArray
object with the file
argument being a VcfFile
object.
vcf <- VariantAnnotation::VcfFile(fl)
VCFArray(file = vcf, name = "DS")
#> <10376 x 5> VCFMatrix object of type "double":
#> HG00096 HG00097 HG00099 HG00100 HG00101
#> rs7410291 0 0 1 0 0
#> rs147922003 0 0 0 0 0
#> rs114143073 0 0 0 0 0
#> ... . . . . .
#> rs5770892 0 0 0 0 1
#> rs144055359 0 0 0 0 0
#> rs114526001 0 0 0 0 0
The file
argument could also take RangedVcfStack
object. Note that an ordinary VcfStack
object without Range
information could not be used to construct a VCFArray
.
extdata <- system.file(package = "GenomicFiles", "extdata")
files <- dir(extdata, pattern="^CEUtrio.*bgz$", full=TRUE)[1:2]
names(files) <- sub(".*_([0-9XY]+).*", "\\1", basename(files))
seqinfo <- as(readRDS(file.path(extdata, "seqinfo.rds")), "Seqinfo")
stack <- GenomicFiles::VcfStack(files, seqinfo)
gr <- as(GenomicFiles::seqinfo(stack)[rownames(stack)], "GRanges")
## RangedVcfStack
rgstack <- GenomicFiles::RangedVcfStack(stack, rowRanges = gr)
rgstack
#> VcfStack object with 2 files and 3 samples
#> GRanges object with 2 ranges and 0 metadata columns
#> Seqinfo object with 25 sequences from hg19 genome
#> use 'readVcfStack()' to extract VariantAnnotation VCF.
Here we choose the name = SB
, which returns a 3-dimensionalVCFArray
object, with the first 2 dimensions correspond to variants and samples respectively.
vcfFields(rgstack)$geno
#> [1] "GT" "AD" "DP" "GQ" "MIN_DP" "PGT" "PID" "PL"
#> [9] "SB"
VCFArray(rgstack, name = "SB")
#> <318 x 3 x 4> VCFArray object of type "integer":
#> ,,1
#> NA12878 NA12891 NA12892
#> rs771898125 NA NA NA
#> rs6118055 NA NA NA
#> ... . . .
#> rs2009307 NA NA NA
#> rs2273444 NA NA NA
#>
#> ...
#>
#> ,,4
#> NA12878 NA12891 NA12892
#> rs771898125 NA NA NA
#> rs6118055 NA NA NA
#> ... . . .
#> rs2009307 NA NA NA
#> rs2273444 NA NA NA
As the vignette title suggest, the backend VCF file could also be remote files. Here we included an example of representing VCF file of chromosome 22 from the 1000 Genomes Project (Phase 3). NOTE that for a remote VCF file, the vindex
argument must be specified. Since this VCF files is relatively big, and it takes longer time, we only show the code here without evaluation.
chr22url <- "https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz"
chr22url.tbi <- paste0(chr22url, ".tbi")
va <- VCFArray(chr22url, vindex =chr22url.tbi, name = "GT")
VCFArray methods
VCFArray
represents VCF files as DelayedArray
instances. It has methods like dim
, dimnames
defined, and it inherits array-like operations and methods from DelayedArray
, e.g., the subsetting method of [
.
NOTE that for 1-dimensional VCFArray
objects that are generated from the fixed / information data field of VCF file, drop = FALSE
should always be used with [
subsetting to ensure VCFArray
object as returned value.
slot accessors
seed
returns the VCFArraySeed
of the VCFArray
object, which includes information about the backend VCF file, e.g., the vcf file path, index file path, name of the data entry (with a prefix of category), dimension and etc.
va <- VCFArray(fl, name = "GT")
seed(va)
#> VCFArraySeed
#> VCF file path: /home/biocbuild/bbs-3.21-bioc/R/site-library/VariantAnnotation/extdata/chr22.vcf.gz
#> VCF index path: /home/biocbuild/bbs-3.21-bioc/R/site-library/VariantAnnotation/extdata/chr22.vcf.gz.tbi
#> array data: GT
#> dim: 10376 x 5
vcffile
returns the VcfFile
object corresponding to the backend VCF file.
vcffile(va)
#> class: VcfFile
#> path: /home/biocbuild/bbs-3.21-bioc/R/site-library/VariantAnnota.../chr22.vcf.gz
#> index: /home/biocbuild/bbs-3.21-bioc/R/site-library/VariantA.../chr22.vcf.gz.tbi
#> isOpen: FALSE
#> yieldSize: NA
dim()
and dimnames()
The dimnames(VCFArray)
returns an unnamed list, with the length of each element being the same as return from dim(VCFArray)
.
va <- VCFArray(fl, name = "GT")
dim(va)
#> [1] 10376 5
class(dimnames(va))
#> [1] "list"
lengths(dimnames(va))
#> [1] 10376 5
[
subsetting
VCFArray
instances can be subsetted, following the usual _R_conventions, with numeric or logical vectors; logical vectors are recycled to the appropriate length.
va[1:3, 1:3]
#> <3 x 3> DelayedMatrix object of type "character":
#> HG00096 HG00097 HG00099
#> rs7410291 "0|0" "0|0" "1|0"
#> rs147922003 "0|0" "0|0" "0|0"
#> rs114143073 "0|0" "0|0" "0|0"
va[c(TRUE, FALSE), ]
#> <5188 x 5> DelayedMatrix object of type "character":
#> HG00096 HG00097 HG00099 HG00100 HG00101
#> rs7410291 "0|0" "0|0" "1|0" "0|0" "0|0"
#> rs114143073 "0|0" "0|0" "0|0" "0|0" "0|0"
#> rs182170314 "0|0" "0|0" "0|0" "0|0" "0|0"
#> ... . . . . .
#> rs9628212 "0|0" "0|0" "0|0" "0|0" "0|0"
#> rs9628178 "0|0" "0|0" "0|0" "0|0" "0|0"
#> rs144055359 "0|0" "0|0" "0|0" "0|0" "0|0"
Some numeric calculation
Numeric calculations could be evaluated on VCFArray
objects.
ds <- VCFArray(fl, name = "DS")
log(ds+5)
#> <10376 x 5> DelayedMatrix object of type "double":
#> HG00096 HG00097 HG00099 HG00100 HG00101
#> rs7410291 1.609438 1.609438 1.791759 1.609438 1.609438
#> rs147922003 1.609438 1.609438 1.609438 1.609438 1.609438
#> rs114143073 1.609438 1.609438 1.609438 1.609438 1.609438
#> ... . . . . .
#> rs5770892 1.609438 1.609438 1.609438 1.609438 1.791759
#> rs144055359 1.609438 1.609438 1.609438 1.609438 1.609438
#> rs114526001 1.609438 1.609438 1.609438 1.609438 1.609438
Internals: VCFArraySeed
The VCFArraySeed
class represents the ‘seed’ for the VCFArray
object. It is not exported from the VCFArray package. Seed objects should contain the VCF file path, and are expected to satisfy the “seed contract” of DelayedArray, i.e. to support dim()
anddimnames()
.
seed <- VCFArray:::VCFArraySeed(fl, name = "GT", pfix = NULL)
seed
#> VCFArraySeed
#> VCF file path: /home/biocbuild/bbs-3.21-bioc/R/site-library/VariantAnnotation/extdata/chr22.vcf.gz
#> VCF index path: /home/biocbuild/bbs-3.21-bioc/R/site-library/VariantAnnotation/extdata/chr22.vcf.gz.tbi
#> array data: GT
#> dim: 10376 x 5
path(vcffile(seed))
#> [1] "/home/biocbuild/bbs-3.21-bioc/R/site-library/VariantAnnotation/extdata/chr22.vcf.gz"
The seed can be used to construct a VCFArray
instance.
(va <- VCFArray(seed))
#> <10376 x 5> VCFMatrix object of type "character":
#> HG00096 HG00097 HG00099 HG00100 HG00101
#> rs7410291 "0|0" "0|0" "1|0" "0|0" "0|0"
#> rs147922003 "0|0" "0|0" "0|0" "0|0" "0|0"
#> rs114143073 "0|0" "0|0" "0|0" "0|0" "0|0"
#> ... . . . . .
#> rs5770892 "0|0" "0|0" "0|0" "0|0" "0|1"
#> rs144055359 "0|0" "0|0" "0|0" "0|0" "0|0"
#> rs114526001 "0|0" "0|0" "0|0" "0|0" "0|0"
The DelayedArray()
constructor with VCFArraySeed
object as inputs will return the same content as the VCFArray()
constructor over the same VCFArraySeed
.
da <- DelayedArray(seed)
class(da)
#> [1] "VCFMatrix"
#> attr(,"package")
#> [1] "VCFArray"
all.equal(da, va)
#> [1] TRUE
sessionInfo()
sessionInfo()
#> R version 4.5.0 RC (2025-04-04 r88126)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.2 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] VariantAnnotation_1.54.0 Rsamtools_2.24.0
#> [3] Biostrings_2.76.0 XVector_0.48.0
#> [5] SummarizedExperiment_1.38.0 Biobase_2.68.0
#> [7] GenomicRanges_1.60.0 GenomeInfoDb_1.44.0
#> [9] VCFArray_1.24.0 DelayedArray_0.34.0
#> [11] SparseArray_1.8.0 S4Arrays_1.8.0
#> [13] abind_1.4-8 IRanges_2.42.0
#> [15] S4Vectors_0.46.0 MatrixGenerics_1.20.0
#> [17] matrixStats_1.5.0 Matrix_1.7-3
#> [19] BiocGenerics_0.54.0 generics_0.1.3
#> [21] BiocStyle_2.36.0
#>
#> loaded via a namespace (and not attached):
#> [1] KEGGREST_1.48.0 rjson_0.2.23 xfun_0.52
#> [4] bslib_0.9.0 lattice_0.22-7 vctrs_0.6.5
#> [7] tools_4.5.0 bitops_1.0-9 curl_6.2.2
#> [10] parallel_4.5.0 AnnotationDbi_1.70.0 RSQLite_2.3.9
#> [13] blob_1.2.4 BSgenome_1.76.0 lifecycle_1.0.4
#> [16] GenomeInfoDbData_1.2.14 compiler_4.5.0 codetools_0.2-20
#> [19] GenomicFiles_1.44.0 htmltools_0.5.8.1 sass_0.4.10
#> [22] RCurl_1.98-1.17 yaml_2.3.10 crayon_1.5.3
#> [25] jquerylib_0.1.4 BiocParallel_1.42.0 cachem_1.1.0
#> [28] digest_0.6.37 restfulr_0.0.15 bookdown_0.43
#> [31] fastmap_1.2.0 grid_4.5.0 cli_3.6.4
#> [34] GenomicFeatures_1.60.0 XML_3.99-0.18 UCSC.utils_1.4.0
#> [37] bit64_4.6.0-1 rmarkdown_2.29 httr_1.4.7
#> [40] bit_4.6.0 png_0.1-8 memoise_2.0.1
#> [43] evaluate_1.0.3 knitr_1.50 BiocIO_1.18.0
#> [46] rtracklayer_1.68.0 rlang_1.1.6 DBI_1.2.3
#> [49] BiocManager_1.30.25 jsonlite_2.0.0 R6_2.6.1
#> [52] GenomicAlignments_1.44.0