Efficient genotype compression and analysis of large genetic-variation data sets - PubMed (original) (raw)

Efficient genotype compression and analysis of large genetic-variation data sets

Ryan M Layer et al. Nat Methods. 2016 Jan.

Abstract

Genotype Query Tools (GQT) is an indexing strategy that expedites analyses of genome-variation data sets in Variant Call Format based on sample genotypes, phenotypes and relationships. GQT's compressed genotype index minimizes decompression for analysis, and its performance relative to that of existing methods improves with cohort size. We show substantial (up to 443-fold) gains in performance over existing methods and demonstrate GQT's utility for exploring massive data sets involving thousands to millions of genomes. GQT can be accessed at https://github.com/ryanlayer/gqt.

PubMed Disclaimer

Conflict of interest statement

COMPETING FINANCIAL INTERESTS

The authors declare no competing financial interests.

Figures

Figure 1

Figure 1

Creation and data exploration of an individual-centric genotype index. (a) The “variant-centric” VCF standard is essentially a genotype matrix whose rows correspond to variants and columns to individuals. (b) The VCF standard is inefficient for queries across all genotypes and a subset of individuals since each variant row must be inspected (in red) to test all of the genotypes of specific individuals. (c) By transposing the matrix such that rows (data records) now represent the full set of genotypes for each individual, the data better aligns to individual-centric questions and algorithms. (d) Sorting the columns of an individual-centric matrix by alternate allele count (AC) improves compressibility. After the variants have been reorganized based on AC, all genotypes for each sample are converted to Word Aligned Hybrid compressed bitmaps (see Supplementary Note). (e) GQT will create a SQLite database of a PED file describing the familial relationships, gender, ancestry, and custom, user-defined sample descriptions. The resulting database allows GQT to quickly extract the specific compressed bitmap records in the genotype index corresponding to a query. Once the compressed bitmaps for the relevant samples are extracted, they are compared to quickly identify the subset of variant(s) that meet the genotype requirements (in this example all individuals must be heterozygous, yielding variant V10 which is denoted by an asterisk) imposed for the subset of individuals.

Figure 2

Figure 2

GQT query performance and applications of the genotype index. (a) Fold speedup for computing the alternate allele frequency (AF) for a targeted 10% of the 2,504 individuals in 1,000 Genomes Phase 3. The baseline was the BCFTOOLS “stats” command. Two versions of GQT output were considered, valid VCF (“GQT query”) and the count of matching variants (“GQT query -c”). (b) Speedup for finding variants having an AF of < 1% in a target 10% of individuals. The baseline was BCFTOOLS “view -C”. PLINK did not directly perform this operation and was excluded. (c) Query performance for simulated genotypes on a 100 Mb genome with between 100 and 100,000 individuals. The speedup for computing the alternate AF count for 10% of individuals is presented. (d) The speedup for finding variants having an AF of < 1% in 10% of individuals. Again, PLINK was excluded. Times reported are for 100,000 simulated genomes. Neither variant nor sample metadata were included. The metrics for 1 million individuals (est.) were estimated using a linear fit. GQT’s runtimes are similar for 2,504 individuals from the 1,000 Genomes and the simulation because the total number of genotypes is nearly identical (2,504 individuals with 84,739,846 variants and 100,000 individuals with 2,052,387 variants, respectively). (e) A principal component analysis of all variants from 1,000 Genomes Phase 3 requiring 207 minutes for 2,504 individuals, and 3 minutes for 347 AMR individuals. (f) Fst analysis of Europeans v. East Asians and Europeans v. Africans on chromosome 12.

Similar articles

Cited by

References

    1. Zuk O, et al. Proc Natl Acad Sci. 2014;111:E455–E464. - PMC - PubMed
    1. Stephens ZD, et al. PLOS Biol. 2015;13:e1002195. - PMC - PubMed
    1. Danecek P, et al. Bioinforma Oxf Engl. 2011;27:2156–2158. - PMC - PubMed
    1. Keinan A, Clark AG. Science. 2012;336:740–743. - PMC - PubMed
    1. 1000 Genomes Project Consortium et al. Nature. 2012;491:56–65. - PubMed

Publication types

MeSH terms

LinkOut - more resources