Sparse Project VCF: efficient encoding of population genotype matrices - PubMed (original) (raw)

Sparse Project VCF: efficient encoding of population genotype matrices

Michael F Lin et al. Bioinformatics. 2021.

Abstract

Summary: Variant Call Format (VCF), the prevailing representation for germline genotypes in population sequencing, suffers rapid size growth as larger cohorts are sequenced and more rare variants are discovered. We present Sparse Project VCF (spVCF), an evolution of VCF with judicious entropy reduction and run-length encoding, delivering >10× size reduction for modern studies with practically minimal information loss. spVCF interoperates with VCF efficiently, including tabix-based random access. We demonstrate its effectiveness with the DiscovEHR and UK Biobank whole-exome sequencing cohorts.

Availability and implementation: Apache-licensed reference implementation: github.com/mlin/spVCF.

Supplementary information: Supplementary data are available at Bioinformatics online.

© The Author(s) 2020. Published by Oxford University Press.

PubMed Disclaimer

Figures

Fig. 1.

Fig. 1.

spVCF encoding example. (A) Illustrative pVCF of four variant loci in three sequenced study participants, with matrix entries encoding called genotypes and several numeric QC measures. Some required VCF fields are omitted for brevity. (B) spVCF encoding of the same example. QC values for reference-identical and non-called cells are reduced to a power-of-two lower bound on read depth DP. Runs of identical entries down columns are abbreviated using quotation marks, then runs of these marks across rows are length-encoded. Cy’s entries are shown column-aligned for clarity; the encoded text matrix is ragged

Similar articles

Cited by

References

    1. Danecek P. et al.; 1000 Genomes Project Analysis Group. (2011) The variant call format and VCFtools. Bioinformatics, 27, 2156–2158. - PMC - PubMed
    1. Danek A., Deorowicz S. (2018) GTC: how to maintain huge genotype collections in a compressed form. Bioinformatics, 34, 1834–1840. - PubMed
    1. Deorowicz S., Danek A. (2019) GTShark: genotype compression in large projects. Bioinformatics, 35, 4791–4793. - PubMed
    1. Dewey F.E. et al. (2016) Distribution and clinical impact of functional variants in 50,726 whole-exome sequences from the discovehr study. Science, 354, aaf6814. - PubMed
    1. Lan D. et al. (2020) genozip: a fast and efficient compression tool for VCF files. Bioinformatics, 36, 4091–4092. - PMC - PubMed

Publication types

MeSH terms

LinkOut - more resources