Reformat VCF Files (original) (raw)

seqVCF2GDS: Reformat VCF Files

Description Usage Arguments Details Value Author(s) References See Also Examples

Reformats Variant Call Format (VCF) files.

seqVCF2GDS(vcf.fn, out.fn, header=NULL, storage.option="LZMA_RA", info.import=NULL, fmt.import=NULL, genotype.var.name="GT", ignore.chr.prefix="chr", scenario=c("general", "imputation"), reference=NULL, start=1L, count=-1L, optimize=TRUE, raise.error=TRUE, digest=TRUE, parallel=FALSE, verbose=TRUE) seqBCF2GDS(bcf.fn, out.fn, header=NULL, storage.option="LZMA_RA", info.import=NULL, fmt.import=NULL, genotype.var.name="GT", ignore.chr.prefix="chr", scenario=c("general", "imputation"), reference=NULL, optimize=TRUE, raise.error=TRUE, digest=TRUE, bcftools="bcftools", verbose=TRUE)

vcf.fn	the file name(s) of VCF format; or a connectionobject
bcf.fn	a file name of binary VCF format (BCF)
out.fn	the file name of output GDS file
header	if NULL, header is set to beseqVCF_Header(vcf.fn)
storage.option	specify the storage and compression option, "ZIP_RA" (seqStorageOption("ZIP_RA")); or "LZMA_RA" to use LZMA compression algorithm with higher compression ratio by default; or "LZ4_RA" to use an extremely fast compression and decompression algorithm. "ZIP_RA.max", "LZMA_RA.max" and "LZ4_RA.max" correspond to the algorithms with a maximum compression level; the suffix "_RA" indicates that fine-level random access is available; see more details at seqStorageOption
info.import	characters, the variable name(s) in the INFO field for import; or NULL for all variables
fmt.import	characters, the variable name(s) in the FORMAT field for import; or NULL for all variables
genotype.var.name	the ID for genotypic data in the FORMAT column; "GT" by default, VCFv4.0
ignore.chr.prefix	a vector of character, indicating the prefix of chromosome which should be ignored, like "chr"; it is not case-sensitive
scenario	"general": use float32 to store floating-point numbers (by default); "imputation": use packedreal16 to store DS and GP in the FORMAT field with four decimal place accuracy
reference	genome reference, like "hg19", "GRCh37"; if the genome reference is not available in VCF files, users could specify the reference here
start	the starting variant if importing part of VCF files
count	the maximum count of variant if importing part of VCF files, -1 indicates importing to the end
optimize	if TRUE, optimize the access efficiency by callingcleanup.gds
raise.error	TRUE: throw an error if numeric conversion fails;FALSE: get missing value if numeric conversion fails
digest	a logical value (TRUE/FALSE) or a character ("md5", "sha1", "sha256", "sha384" or "sha512"); add md5 hash codes to the GDS file if TRUE or a digest algorithm is specified
parallel	FALSE (serial processing), TRUE (parallel processing), a numeric value indicating the number of cores, or a cluster object for parallel processing; parallel is passed to the argument cl in seqParallel, seeseqParallel for more details
verbose	if TRUE, show information
bcftools	the path of the program bcftools

If there are more than one files in vcf.fn, seqVCF2GDS will merge all VCF files together if they contain the same samples. It is useful to merge multiple VCF files if variant data are split by chromosomes.

The real numbers in the VCF file(s) are stored in 32-bit floating-point format by default. Users can setstorage.option=seqStorageOption(float.mode="float64")to switch to 64-bit floating point format. Or packed real numbers can be adopted by settingstorage.option=seqStorageOption(float.mode="packedreal16:scale=0.0001").

By default, the compression method is "LZMA_RA" (http://tukaani.org/xz, LZMA algorithm with default compression level + independent data blocks for fine-level random access). Users can maximize the compression ratio bystorage.option="LZMA_RA.max" orstorage.option=seqStorageOption("LZMA_RA.max"). LZMA is known to have higher compression ratio than the zlib algorithm. LZ4 (https://github.com/lz4/lz4) is an option viastorage.option="LZ4_RA" orstorage.option=seqStorageOption("LZ4_RA").

If multiple cores/processes are specified in parallel, all VCF files are scanned to calculate the total number of variants before format conversion, and then split by the number of cores/processes.

storage.option="Ultra" and storage.option="UltraMax" need much larger memory than other compression methods. Users may consider usingseqRecompress to recompress the GDS file after callingseqVCF2GDS() with storage.option="ZIP_RA", sinceseqRecompress() compresses data nodes one by one, taking much less memory than "Ultra" and "UltraMax".

If storage.option="LZMA_RA" runs out of memory (e.g., there are too many annotation fields in the VCF file), users could usestorage.option="ZIP_RA" and then callseqRecompress(, compress="LZMA").

Return the file name of GDS format with an absolute path.

Xiuwen Zheng

Danecek, P., Auton, A., Abecasis, G., Albers, C.A., Banks, E., DePristo, M.A., Handsaker, R.E., Lunter, G., Marth, G.T., Sherry, S.T., et al. (2011). The variant call format and VCFtools. Bioinformatics 27, 2156-2158.

[seqVCF_Header](/bioc/SeqArray/man/seqVCF%5FHeader.html), [seqStorageOption](/bioc/SeqArray/man/seqStorageOption.html),[seqMerge](/bioc/SeqArray/man/seqMerge.html), [seqGDS2VCF](/bioc/SeqArray/man/seqGDS2VCF.html),[seqRecompress](/bioc/SeqArray/man/seqRecompress.html)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37

# the VCF file vcf.fn <- seqExampleFileName("vcf") # conversion seqVCF2GDS(vcf.fn, "tmp.gds", storage.option="ZIP_RA") # conversion in parallel seqVCF2GDS(vcf.fn, "tmp_p2.gds", storage.option="ZIP_RA", parallel=2L) # display (f <- seqOpen("tmp.gds")) seqClose(f) # convert without the INFO fields seqVCF2GDS(vcf.fn, "tmp.gds", storage.option="ZIP_RA", info.import=character(0)) # display (f <- seqOpen("tmp.gds")) seqClose(f) # convert without the INFO and FORMAT fields seqVCF2GDS(vcf.fn, "tmp.gds", storage.option="ZIP_RA", info.import=character(0), fmt.import=character(0)) # display (f <- seqOpen("tmp.gds")) seqClose(f) # delete the temporary file unlink(c("tmp.gds", "tmp_p2.gds"), force=TRUE)

SeqArray documentation built on Nov. 8, 2020, 5:08 p.m.