Using a Validation Dataset | DougSpeed.com (original) (raw)

Here we explain how to construct a prediction model using MegaPRS when you have access to a validation dataset (i.e., an independent set of individuals for whom you have both genetic data and phenotypes). These instructions assume that you are analysing summary statistics (if instead you are analysing individual-level data, we recommend you use Elastic-Predict). Please note that when running MegaPRS, it is no longer necessary to first compute Per-Predictor Heritabilities.

Please note that if you do not have access to a validation dataset, you should instead read the regular instructions for running MegaPRS.

Always read the screen output, which suggests arguments and estimates memory usage.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Calculate predictor-predictor correlations:

The main argument is --calc-cors .

This requires the options

--bfile/--gen/--sp/--speed or --bgen - to specify the genetic data files (see File Formats).

--break-points , --window-cm or --window-kb - to specify how to partition the genome into windows. If analysing human data, we recommend downloading the file berisa.txt from Resources, then using --break-points berisa.txt; if analysing non-human data, we recommend using --window-kb 3000.

You can use --keep and/or --remove to restrict to a subset of samples, and --extract and/or --exclude to restrict to a subset of predictors. In particular, you can reduce the computational burden by restricting to predictors for which you have summary statistics. Note that if you plan to analyse summary statistics from multiple association studies, each of which used different SNPs, then it is probably easiest to calculate predictor-predictor correlations once using all available predictors, then use these correlations for all analyses.

By default, LDAK will exclude SNPs with minor allele frequency below 0.01 (this is because prediction models can be sensitive to low-frequency SNPs). To change this threshold, use --min-maf .

The predictor-predictor correlations will be saved in files with prefix ; in particular, .cors.bin contains the predictor-predictor correlations (however, this is a binary file, so not human-readable).

To parallelize this process, you can add --chr to calculate correlations for each chromosome separately, then combine these with the argument --join-cors , using the option --corslist to specify the per-chromosome correlations (see the example below).
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Construct multiple prediction models:

The main argument is --mega-prs .

This requires the options

--summary - to specify the file containing the summary statistics.

--power - to specify how predictors are scaled. In general, we recommend using --power -0.25.

--cors - to specify the predictor-predictor correlations.

--skip-cv YES - this tells LDAK that you are not using pseudo cross-validation (LDAK will then output multiple models, each trained using 100% of samples).

While it is still possible to use --ind-hers to provide Per-Predictor Heritabilities, this is NO LONGER RECOMMENDED.

Use --model to specify which version of MegaPRS to run. By default, LDAK will run BayesR-SS (equivalent to adding --model bayesr); alternative versions are Lasso-SS (--model lasso), Lasso-SS-Sparse (--model lasso-sparse), Ridge-SS (--model ridge), Bolt-SS (--model bolt), BayesR-SS-Shrink (--model bayesr-shrink) and Elastic-SS (--model elastic). If you use --model mega, MegaPRS will run Lasso-SS, Ridge-SS, BayesR-SS AND Elastic-SS.

You can use --keep and/or --remove to restrict to a subset of samples, and --extract and/or --exclude to restrict to a subset of predictors.

By default, LDAK ignores alleles with ambiguous alleles (those with alleles A & T or C & G) to protect against possible strand errors. If you are confident that these are correctly aligned, you can force LDAK to include them by adding --allow-ambiguous YES.

LDAK returns an error if it does not find valid summary statistics for all predictors. Ideally, you should work out why this has happened, and either correct the summary statistics file, or use --extract and/or --exclude to restrict to a subset of predictors. However, you can instead override the error by adding --check-sums NO.

The prediction models corresponding to different prior distribution parameters are saved in .full.effects. This file will have 4+M columns, providing the predictor name, its A1 and A2 alleles, the average number of A1 alleles, then its estimated effect (relative to the A1 allele) for each of the M models.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Use the validation dataset to find the best prediction model

The main argument is --validate .

This requires the options

--scorefile - to provide the prediction models constructed in the previous step.

–bfile/--gen/--sp/--speed or --bgen - to specify the genetic data files in the validation dataset (see File Formats).

--pheno - to specify phenotypes (in PLINK format) for the samples in the validation dataset. If contains more than one phenotype, specify which should be used with --mpheno or --pheno-name (the latter requires that has a header row), or use --mpheno ALL to analyse all phenotypes.

The estimated accuracy of each prediction model (the correlation between predicted and observed phenotypes) is saved in .cors. The best prediction model (i.e, the one with highest accuracy) is saved in .best.effects. This file has five columns, providing the predictor name, its A1 and A2 alleles, the average number of A1 alleles, then its estimated effect (relative to the A1 allele), and is ready to be used for Calculating Scores (i.e., to predict the phenotypes of new samples).
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Example:

Here we use the binary PLINK files human.bed, human.bim and human.fam, and the phenotype quant.pheno from the Test Datasets, and the files berisa.txt from Resources.

For this example we require summary statistics, a reference panel and an independent validation dataset. We make these by running

head -n 300 human.fam > keep
./ldak.out --linear quant.keep --bfile human --pheno quant.pheno --keep keep
./ldak.out --make-bed human.ref --bfile human --keep keep
./ldak.out --make-bed human.val --bfile human --remove keep

Summary statistics are saved in quant.keep.summaries (already in the format required by LDAK). We will use the genetic data files with prefix human.ref as the reference panel, and the files with prefix human.val as the validation dataset (phenotypes for these samples are in quant.pheno).
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

1 - Calculate predictor-predictor correlations.

We run the command

./ldak.out --calc-cors cors.ref --bfile human.ref --break-points berisa.txt

The significant correlations are saved in cors.ref.cors.bin, with some details in cors.ref.cors.root.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

2 - Construct multiple prediction models.

We run the command

./ldak.out --mega-prs megaA --summary quant.keep.summaries --power -0.25 --cors cors.ref --skip-cv YES --allow-ambiguous YES

Note that we add --allow-ambiguous YES (because the summary statistics were obtained from the genetic data we are using as a reference panel, we know the strands must be consistent). This command will construct 35 prediction models, saved in megaA.full.effects.

Note that if analysing a proper-sized human dataset, we would recommend first downloading and unzipping the file BaselineLD.zip from Resources, then adding --annotation-number 86 and --annotation-prefix BaselineLD to the step above.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

3 - Use the validation dataset to find the best prediction model.

We run the command

./ldak.out --validate megaB --scorefile megaA.full.effects --bfile human.val --pheno quant.pheno

The best prediction model is saved in megaB.best.effects.