Second-generation PLINK: rising to the challenge of larger and richer datasets - PubMed (original) (raw)
Second-generation PLINK: rising to the challenge of larger and richer datasets
Christopher C Chang et al. Gigascience. 2015.
Abstract
Background: PLINK 1 is a widely used open-source C/C++ toolset for genome-wide association studies (GWAS) and research in population genetics. However, the steady accumulation of data from imputation and whole-genome sequencing studies has exposed a strong need for faster and scalable implementations of key functions, such as logistic regression, linkage disequilibrium estimation, and genomic distance evaluation. In addition, GWAS and population-genetic data now frequently contain genotype likelihoods, phase information, and/or multiallelic variants, none of which can be represented by PLINK 1's primary data format.
Findings: To address these issues, we are developing a second-generation codebase for PLINK. The first major release from this codebase, PLINK 1.9, introduces extensive use of bit-level parallelism, [Formula: see text]-time/constant-space Hardy-Weinberg equilibrium and Fisher's exact tests, and many other algorithmic improvements. In combination, these changes accelerate most operations by 1-4 orders of magnitude, and allow the program to handle datasets too large to fit in RAM. We have also developed an extension to the data format which adds low-overhead support for genotype likelihoods, phase, multiallelic variants, and reference vs. alternate alleles, which is the basis of our planned second release (PLINK 2.0).
Conclusions: The second-generation versions of PLINK will offer dramatic improvements in performance and compatibility. For the first time, users without access to high-end computing resources can perform several essential analyses of the feature-rich and very large genetic datasets coming into use.
Keywords: Computational statistics; GWAS; High-density SNP genotyping; Population genetics; Whole-genome sequencing.
Figures
Figure 1
2 × 2 contingency table log-frequencies. This is a plot of relative frequencies of 2 × 2 contingency tables with top row sum 1000, left column sum 40000, and grand total 100000, reflecting a low-MAF variant where the difference between the chi-square test and Fisher’s exact test is relevant. All such tables with upper left value smaller than 278, or larger than 526, have frequency smaller than 2−53 (dotted horizontal line); thus, if the obvious summation algorithm is used, they have no impact on the p-value denominator due to numerical underflow. (It can be proven that this underflow has negligible impact on accuracy, due to how rapidly the frequencies decay.) A few more tables need to be considered when evaluating the numerator, but we can usually skip at least 70%, and this fraction improves as problem size increases.
Figure 2
Computation pattern for our 2 × 3 Fisher’s exact test implementation. This is a plot of the set of alternative 2 × 3 contigency tables explicitly considered by our algorithm when testing the table with 65, 136, 324 in the top row and 81, 172, 314 in the bottom row. Letting ℓ denote the relative likelihood of observing the tested table under the null hypothesis, the set of tables with null hypothesis relative likelihoods between 2−53_ℓ_ and ℓ has an ellipsoidal annulus shape, with area scaling as O(n) as the problem size increases; while the set of tables with relative likelihood greater than 2−53_l_max (where _l_max is the maximal single-table relative likelihood) has an elliptical shape, also with O(n) area. Summing the relative likelihoods in the first set, and then dividing that number by the sum of the relative likelihoods in the second set, yields the desired p-value to 10+ digit accuracy in O(n) time. In addition, we exploit the fact that a “row” of 2 × 3 table likelihoods sums to a single 2 × 2 table likelihood; this lets us essentially skip the top and bottom of the annulus, as well as all but a single row of the central ellipse.
Figure 3
Rapid classification of “recombination” variant pairs. This is a plot of 101 equally spaced D’ log-likelihoods for (rs58108140, rs140337953) in 1000 Genomes phase 1, used in Gabriel et al.’s method of identifying haplotype blocks. Whenever the upper end of the 90% confidence interval is smaller than 0.90 (i.e. the rightmost 11 likelihoods sum to less than 5% of the total), we have strong evidence for historical recombination between the two variants. After determining that L(_D_′=x) has only one extreme value in [0, 1] and that it’s between 0.39 and 0.40, confirming L(_D_′=0.90)<L(_D_′=0.40)/220 is enough to finish classifying the variant pair (due to monotonicity: L(D_′=0.90)≥_L(D_′=0.91)≥…≥_L(_D_′=1.00)); evaluation of the other 99 likelihoods is now skipped in this case. The dotted horizontal line is at L(_D_′=0.40)/220.
Similar articles
- coPLINK: A complementary tool to PLINK.
Liu HM, Liu ZF, Zheng JP, Yang D, Hu SZ, Yan SH, He XW. Liu HM, et al. PLoS One. 2020 Sep 18;15(9):e0239144. doi: 10.1371/journal.pone.0239144. eCollection 2020. PLoS One. 2020. PMID: 32946477 Free PMC article. - Scalable linkage-disequilibrium-based selective sweep detection: a performance guide.
Alachiotis N, Pavlidis P. Alachiotis N, et al. Gigascience. 2016 Feb 8;5:7. doi: 10.1186/s13742-016-0114-9. eCollection 2016. Gigascience. 2016. PMID: 26862394 Free PMC article. - Stepwise Distributed Open Innovation Contests for Software Development: Acceleration of Genome-Wide Association Analysis.
Hill A, Loh PR, Bharadwaj RB, Pons P, Shang J, Guinan E, Lakhani K, Kilty I, Jelinsky SA. Hill A, et al. Gigascience. 2017 May 1;6(5):1-10. doi: 10.1093/gigascience/gix009. Gigascience. 2017. PMID: 28327993 Free PMC article. - SNPrune: an efficient algorithm to prune large SNP array and sequence datasets based on high linkage disequilibrium.
Calus MPL, Vandenplas J. Calus MPL, et al. Genet Sel Evol. 2018 Jun 26;50(1):34. doi: 10.1186/s12711-018-0404-z. Genet Sel Evol. 2018. PMID: 29940846 Free PMC article. - Genotype Imputation in Genome-Wide Association Studies.
Naj AC. Naj AC. Curr Protoc Hum Genet. 2019 Jun;102(1):e84. doi: 10.1002/cphg.84. Curr Protoc Hum Genet. 2019. PMID: 31216114 Review.
Cited by
- Disentangling the worldwide invasion process of Halyomorpha halys through approximate Bayesian computation.
Boscolo Agostini R, Vizzari MT, Benazzo A, Ghirotto S. Boscolo Agostini R, et al. Heredity (Edinb). 2024 Nov 18. doi: 10.1038/s41437-024-00735-9. Online ahead of print. Heredity (Edinb). 2024. PMID: 39558034 - Rare copy number variant analysis in case-control studies using snp array data: a scalable and automated data analysis pipeline.
Artaza H, Lavrichenko K, Wolff ASB, Røyrvik EC, Vaudel M, Johansson S. Artaza H, et al. BMC Bioinformatics. 2024 Nov 15;25(1):357. doi: 10.1186/s12859-024-05979-0. BMC Bioinformatics. 2024. PMID: 39548362 Free PMC article. - Assessing the predictive efficacy of European-based systolic blood pressure polygenic risk scores in diverse Brazilian cohorts.
Teixeira SK, Rossi FPN, Patane JL, Neyra JM, Jensen AVV, Horta BL, Pereira AC, Krieger JE. Teixeira SK, et al. Sci Rep. 2024 Nov 15;14(1):28123. doi: 10.1038/s41598-024-79683-7. Sci Rep. 2024. PMID: 39548300 Free PMC article. - Gene-environment interactions in the influence of maternal education on adolescent neurodevelopment using ABCD study.
Shi R, Chang X, Banaschewski T, Barker GJ, Bokde ALW, Desrivières S, Flor H, Grigis A, Garavan H, Gowland P, Heinz A, Brühl R, Martinot JL, Martinot MP, Artiges E, Nees F, Orfanos DP, Poustka L, Hohmann S, Holz N, Smolka MN, Vaidya N, Walter H, Whelan R, Schumann G, Lin X, Feng J; IMAGEN Consortium. Shi R, et al. Sci Adv. 2024 Nov 15;10(46):eadp3751. doi: 10.1126/sciadv.adp3751. Epub 2024 Nov 15. Sci Adv. 2024. PMID: 39546599 Free PMC article. - Whole-genome sequencing reveals genetic structure and adaptive genes in Nepalese buffalo breeds.
Dhakal A, Si J, Sapkota S, Pauciullo A, Han J, Gorkhali NA, Zhao X, Zhang Y. Dhakal A, et al. BMC Genomics. 2024 Nov 14;25(1):1082. doi: 10.1186/s12864-024-10993-w. BMC Genomics. 2024. PMID: 39543523 Free PMC article.
References
Publication types
MeSH terms
LinkOut - more resources
Full Text Sources
Other Literature Sources