MegaPRS | DougSpeed.com (original) (raw)

Here we explain how to construct a prediction model using MegaPRS. These instructions assume that you are analysing GWAS summary statistics (if instead you are analysing individual-level data, we recommend you use Elastic-Predict).

Before running MegaPRS, you should ensure that your summary statistics are in the format required by LDAK (see Summary Statistics for details), and that you have an ancestrally-matched Reference Panel.

Please note that the current version of MegaPRS is simpler than the original version (e.g., it is no longer necessary to separately calculate Per-Predictor Heritabilities). Also note that if using SNP-based summary statistics from a human GWAS, you may prefer to use QuickPRS (which is similar to MegaPRS, except that it does not require a reference panel).

Always read the screen output, which suggests arguments and estimates memory usage.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

MegaPRS has two steps: the first calculates predictor-predictor correlations, while the second estimates predictor effect sizes. Note that the first step only needs to be performed once per reference panel (whereas the second step must be performed once per set of summary statistics).

Calculate predictor-predictor correlations:

The main argument is --calc-cors .

This requires the options

--bfile/--gen/--sp/--speed or --bgen - to specify the reference panel data files (see File Formats).

--break-points , --window-cm or --window-kb - to specify how to partition the genome into windows. If analysing human data, we recommend downloading the file berisa.txt from Resources, then using --break-points berisa.txt; if analysing non-human data, we recommend using --window-kb 3000.

You can use --keep and/or --remove to restrict to a subset of samples, and --extract and/or --exclude to restrict to a subset of predictors. In particular, you can reduce the computational burden by restricting to predictors for which you have summary statistics. Note that if you plan to analyse summary statistics from multiple association studies, each of which used different predictors, then it is probably easiest to calculate predictor-predictor correlations once using all available predictors, then use these correlations for all analyses.

By default, LDAK will exclude predictors with minor allele frequency below 0.005 (this is because prediction models can be sensitive to low-frequency predictors). To change this threshold, use --min-maf .

The predictor-predictor correlations will be saved in files with prefix ; in particular, .cors.bin contains the predictor-predictor correlations (however, this is a binary file, so not human-readable).

To parallelize this process, you can add --chr to calculate correlations for each chromosome separately, then combine these with the argument --join-cors , using the option --corslist to specify the per-chromosome correlations (see the example below).
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Construct the prediction model:

The main argument is --mega-prs .

This requires the options

--summary - to specify the file containing the summary statistics.

--cors - to specify the predictor-predictor correlations.

To specify the heritability model, you should use either --partition-number and --partition-prefix , or --annotation-number and --annotation-prefix (more details in the section below).

If the summary statistics file contains A1 frequencies, LDAK will compute the difference between the frequencies in the summary statistics and those in the correlations, then exclude predictors for which the difference is greater than 0.2 (you can stop this by adding --check-frequencies NO).

LDAK returns an error if it does not find valid summary statistics for all predictors. Ideally, you should work out why this has happened, and either correct the summary statistics file, or use --extract and/or --exclude to restrict to a subset of predictors. However, you can instead override the error by adding --check-sums NO.

You can use --model to specify which version of MegaPRS to run. By default, LDAK will run BayesR-SS (equivalent to adding --model bayesr), which is our recommended version. The other versions are Lasso-SS (--model lasso), Lasso-SS-Sparse (--model lasso-sparse), Ridge-SS (--model ridge), Bolt-SS (--model bolt), BayesR-SS-Shrink (--model bayesr-shrink) and Elastic-SS (--model elastic). Meanwhile, if you use --model mega, MegaPRS will run Lasso-SS, Ridge-SS, BayesR-SS AND Elastic-SS.

In general, we recommend estimating effect sizes using Variational Bayes. The exception is when using an in-sample reference panel, in which case MCMC can perform better. To force LDAK to use MCMC, use --MCMC-solve YES (by default, LDAK will then use four MCMC chains, each with 250 iterations, the first 50 of which are discarded; you can change these values using --MCMC-chains, --MCMC-iterations and --MCMC-burn). Note that if LDAK detects an in-sample reference panel, it will recommend you switch to MCMC; you can override this warning using --check-MCMC NO.

If you wish to compute PRS standard deviations, you should add --PRS-variance YES. Note that this is only possible when estimating effect sizes using MCMC (LDAK will then save effect sizes every tenth MCMC iteration; change this frequency using --MCMC-step).

By default, LDAK will consider multiple sets of model parameters, then use pseudo cross-validation to identify the best-fitting set (to understand how this works, see Pseudo Summaries). Note that if you add --save-all-models YES, then LDAK will additionally report effect sizes for all sets of model parameters. Meanwhile, if you have access to a validation dataset (i.e., an independent set of individuals for whom you have both genetic data and phenotypes), you can instead run MegaPRS Using a Validation Dataset.

The effect sizes are saved in .effects. This file will usually have five columns, providing the predictor name, its A1 and A2 alleles, the average number of A1 alleles, then its raw effect (relative to the A1 allele) and is ready to be used for Calculating Scores (i.e., to predict the phenotypes of new samples). Note that if you used --PRS-variance YES, then .effects will contain extra columns corresponding to the MCMC samplings. Meanwhile, if you used --save-all-models YES, then LDAK will also create the file .full.effects, which contains effect sizes corresponding to all sets of model parameters.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Specifying the heritability model:

To understand this section, it will help if you have first read Heritability Model and Technical Details.

If no heritability model is provided, LDAK assumes the one-parameter heritability model

E[h2j] = tau1 [fj(1-fj)](1+alpha)

where fj is its minor allele frequency of predictor j. LDAK will estimate alpha from the data (alternatively, you can specify its value using --power).

To specify a multi-parameter model, you should either use --partition-number and --partition-prefix to provide partitions, or use --annotation-number and --annotation-prefix to provide annotation. LDAK will then use a heritability model of the form

E[h2j] = tau1 a1j [fj(1-fj)](1+alpha) + tau2 a2j [fj(1-fj)](1+alpha) + ... + tauK aKj[fj(1-fj)](1+alpha)

where akj is the value assigned to predictor j in the kth partition / annotation, read from the file j (if this file does not contain predictor j, the value is set to zero). When using partitions, K is the total number of partitions. However, when using annotations, K is the total number of annotations plus one (LDAK always adds a base category, and sets aKj =1 for all predictors).

When using human SNP data, we recommend specifying the Baseline LD Model (version 2.2). You should first download and unzip the SNP annotations from Resources (the result will be files called BaselineLD1, BaselineLD2, etc). Then when running MegaPRS, you should add --annotation-number 86 and --annotation-prefix BaselineLD.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Example:

Here we use the binary PLINK files human.bed, human.bim and human.fam, the phenotype quant.pheno from the Test Datasets, and the files berisa.txt and BaselineLD.zip from Resources.

Although we have individual-level data (i.e., we have genotypes and phenotypes for the same samples), for this example we will pretend we are using summary statistics. Therefore, we will first create summary statistics by running

./ldak.out --linear quant --bfile human --pheno quant.pheno

The summary statistics are saved in quant.summaries (already in the format required by LDAK). For more details on this command, see Single-Predictor Analysis. Then we will use the genetic data files as the reference panel.

Please note that this example is very simple; a more realistic example is provided in Worked Examples.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

1 - Calculate predictor-predictor correlations.

We run the command

./ldak.out --calc-cors cors --bfile human --break-points berisa.txt

The correlations are saved in files with prefix cors, with some details in cors.cors.root. If analysing very large data, we can parallelise the process by computing correlations separately for each chromosome, then merging

for j in {21..22}; do
./ldak.out --calc-cors cors$j --bfile human --break-points berisa.txt --chr $j
done

rm list.txt; for j in {21..22}; do echo "cors$j" >> list.txt; done
./ldak.out --join-cors cors --corslist list.txt

Note that in these scripts, we loop from 21 to 22, because our example dataset contains only these two chromosomes; usually you would loop from 1 to 22.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

2 - Construct the prediction model.

We first run MegaPRS with the default heritability model (i.e., without SNP annotations)

./ldak.out --mega-prs mega --summary quant.summaries --cors cors --allow-ambiguous YES

The estimated effect sizes are saved in mega.effects.

We now run MegaPRS with the Baseline LD Model. We first extract the SNP annotations

unzip BaselineLD.zip

Then we provide the first 86 annotations when running MegaPRS

./ldak.out --mega-prs mega2 --summary quant.summaries --cors cors --allow-ambiguous YES --annotation-number 86 --annotation-prefix BaselineLD

The estimated effect sizes are saved in mega2.effects.