Second-generation PLINK: rising to the challenge of larger and richer datasets - PubMed (original) (raw)

Christopher C Chang et al. Gigascience. 2015.

Abstract

Background: PLINK 1 is a widely used open-source C/C++ toolset for genome-wide association studies (GWAS) and research in population genetics. However, the steady accumulation of data from imputation and whole-genome sequencing studies has exposed a strong need for faster and scalable implementations of key functions, such as logistic regression, linkage disequilibrium estimation, and genomic distance evaluation. In addition, GWAS and population-genetic data now frequently contain genotype likelihoods, phase information, and/or multiallelic variants, none of which can be represented by PLINK 1's primary data format.

Findings: To address these issues, we are developing a second-generation codebase for PLINK. The first major release from this codebase, PLINK 1.9, introduces extensive use of bit-level parallelism, [Formula: see text]-time/constant-space Hardy-Weinberg equilibrium and Fisher's exact tests, and many other algorithmic improvements. In combination, these changes accelerate most operations by 1-4 orders of magnitude, and allow the program to handle datasets too large to fit in RAM. We have also developed an extension to the data format which adds low-overhead support for genotype likelihoods, phase, multiallelic variants, and reference vs. alternate alleles, which is the basis of our planned second release (PLINK 2.0).

Conclusions: The second-generation versions of PLINK will offer dramatic improvements in performance and compatibility. For the first time, users without access to high-end computing resources can perform several essential analyses of the feature-rich and very large genetic datasets coming into use.

Keywords: Computational statistics; GWAS; High-density SNP genotyping; Population genetics; Whole-genome sequencing.

PubMed Disclaimer

Figures

Figure 1

Figure 1

2 × 2 contingency table log-frequencies. This is a plot of relative frequencies of 2 × 2 contingency tables with top row sum 1000, left column sum 40000, and grand total 100000, reflecting a low-MAF variant where the difference between the chi-square test and Fisher’s exact test is relevant. All such tables with upper left value smaller than 278, or larger than 526, have frequency smaller than 2−53 (dotted horizontal line); thus, if the obvious summation algorithm is used, they have no impact on the p-value denominator due to numerical underflow. (It can be proven that this underflow has negligible impact on accuracy, due to how rapidly the frequencies decay.) A few more tables need to be considered when evaluating the numerator, but we can usually skip at least 70%, and this fraction improves as problem size increases.

Figure 2

Figure 2

Computation pattern for our 2 × 3 Fisher’s exact test implementation. This is a plot of the set of alternative 2 × 3 contigency tables explicitly considered by our algorithm when testing the table with 65, 136, 324 in the top row and 81, 172, 314 in the bottom row. Letting denote the relative likelihood of observing the tested table under the null hypothesis, the set of tables with null hypothesis relative likelihoods between 2−53_ℓ_ and has an ellipsoidal annulus shape, with area scaling as O(n) as the problem size increases; while the set of tables with relative likelihood greater than 2−53_l_max (where _l_max is the maximal single-table relative likelihood) has an elliptical shape, also with O(n) area. Summing the relative likelihoods in the first set, and then dividing that number by the sum of the relative likelihoods in the second set, yields the desired p-value to 10+ digit accuracy in O(n) time. In addition, we exploit the fact that a “row” of 2 × 3 table likelihoods sums to a single 2 × 2 table likelihood; this lets us essentially skip the top and bottom of the annulus, as well as all but a single row of the central ellipse.

Figure 3

Figure 3

Rapid classification of “recombination” variant pairs. This is a plot of 101 equally spaced D’ log-likelihoods for (rs58108140, rs140337953) in 1000 Genomes phase 1, used in Gabriel et al.’s method of identifying haplotype blocks. Whenever the upper end of the 90% confidence interval is smaller than 0.90 (i.e. the rightmost 11 likelihoods sum to less than 5% of the total), we have strong evidence for historical recombination between the two variants. After determining that L(_D_′=x) has only one extreme value in [0, 1] and that it’s between 0.39 and 0.40, confirming L(_D_′=0.90)<L(_D_′=0.40)/220 is enough to finish classifying the variant pair (due to monotonicity: L(D_′=0.90)≥_L(D_′=0.91)≥…≥_L(_D_′=1.00)); evaluation of the other 99 likelihoods is now skipped in this case. The dotted horizontal line is at L(_D_′=0.40)/220.

Similar articles

Cited by

References

    1. Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira M, Bender D, et al. Plink: A tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007;81:559–75. doi: 10.1086/519795. - DOI - PMC - PubMed
    1. Browning B, Browning S. Improving the accuracy and efficiency of identity by descent detection in population data. Genetics. 2013;194:459–71. doi: 10.1534/genetics.113.150029. - DOI - PMC - PubMed
    1. Howie B, Donnelly P, Marchini J. A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genet. 2009;5:1000529. doi: 10.1371/journal.pgen.1000529. - DOI - PMC - PubMed
    1. McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, et al. The genome analysis toolkit: A mapreduce framework for analyzing next-generation dna sequencing data. Genome Res. 2010;20:1297–303. doi: 10.1101/gr.107524.110. - DOI - PMC - PubMed
    1. Danecek P, Auton A, Abecasis G, Albers C, Banks E, DePristo M, et al. The variant call format and vcftools. Bioinformatics. 2011;27:2156–8. doi: 10.1093/bioinformatics/btr330. - DOI - PMC - PubMed

Publication types

MeSH terms

LinkOut - more resources