Calculate Scores | DougSpeed.com (original) (raw)

Here we explain how to calculate linear combinations of predictor values (i.e., the projections of genetic data onto predictor effect sizes). This feature is most commonly used for creating polygenic risk scores (PRS). However, it can also be used for Quality Control (e.g., if you wish to infer the ancestry of individuals in a dataset, you can first use a global dataset to calculate population axes, then project your dataset onto these axes).

Always read the screen output, which suggests arguments and estimates memory usage.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

The main argument is --calc-scores .

This requires the following options

--scorefile - to provide the predictor effect sizes. The score file must have 4+M columns, where M is the number of sets of effect size. The first four columns should provide the name of each predictor, the A1 and A2 alleles and the mean count of the A1 allele; the remaining columns should provide the sets of effect sizes. The file must have a header row, whose first four elements are "Predictor", "A1", "A2" and "Centre". Note that if the predictor means are not known, they can be set to "NA" (in which case LDAK will centre predictors based on the mean A1 allele count in the genetic data). See below for an example scorefile.

–bfile/–gen/–sp/–speed or --bgen - to specify genetic data files (see File Formats)

--power - to specify how predictors are scaled (see below). Usually, the score file contains raw effects, so you should use --power 0.

Note that if the score file was created using MegaPRS with the option --PRS-variance YES, then you should also add --PRS-variance YES when calculating scores.

There are two ways to compute the accuracy of the resulting scores. You can either use --pheno to provide phenotypes for the individuals in the genetic dataset (LDAK will then calculate the correlation between the scores and the phenotypes). Alternatively, you can use --summary to provide summary statistics from an association study (LDAK will then estimate the correlation between the PRS and phenotypic values for the samples in the association study; note that this will use the genetic data as a Reference Panel, in order to estimate predictor-predictor correlations for the samples in the association study).

The scores will be saved in .profile, which will have 4+2M columns. Suppose there are two sets of effect sizes (i.e., M=2), then Columns 5 & 7 of .profile will contain the corresponding scores. By default, Columns 6 & 8 will be blank. However, if you using --PRS-variance YES, then these columns will report standard deviations.

Sometimes you will have two sets of prediction models, for example, one computed using training samples and one computed using all samples. You can then use --scorefile to provide the first set of prediction models, --pheno or --summary to provide phenotypic values or summary statistics, and --final-effects to provide the second set of prediction models. LDAK will save to .effects.best the prediction model from the second set that corresponds to the most accurate model from the first set (for example, if Model 5 from the first set has highest correlation, LDAK will save Model 5 from the second set).
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Formula:

Suppose there are m predictors. Let cj and bj be the centre and effect size of Predictor j. By default, the score of Sample i is

Pi = b1 (Xi1-c1) (c1(1-c1/2))^(power/2) + ... + bm (Xim-cm) (cm(1-cm/2))^(power/2)

where Xij is the value of Sample i for Predictor j. When the predictors are SNPs, cj/2 is an estimate of the allele frequency of Predictor j, and therefore (cj(1-cj/2)) is its expected variance assuming Hardy-Weinberg Equilibrium. When power equals one, predictors are divided by their expected standard deviation (i.e., are standardised). Therefore, you should use --power=-1 when the score file contains standardised effect sizes. By contrast, when power equals zero, the formula reduces to

Pi = b1 (Xi1-c1) + ... + bm (Xim-cm)

so that predictors are no longer scaled. Therefore, you should use --power=0 when the score file contains raw effect sizes.

If you add --hwe-stand NO, then LDAK will instead use the formula

Pi = bj (Xi1-c1) v1^(power/2) + ... + bm (Xim-cm) vm^(power/2)

where vj is the variance of Predictor j in the genetic data.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Example:

Here we use the binary PLINK files human.bed, human.bim and human.fam, and the phenotype quant.pheno from the Test Datasets. We will construct a toy score file using the following command

echo "Predictor A1 A2 Centre Effect1 Effect2
21:14642464 A G 0.88 0.3 -0.1
21:14649798 C A 0.97 -0.2 0.4" > scores.txt

Note that 21:14642464 and 21:14649798 are the names of the first two SNPs in human.bim. We can see they are both common SNPs (their centres are close to 1, meaning that their minor allele frequencies are close to 0.5).

We calculate scores by running

./ldak.out --calc-scores scores --scorefile scores.txt --bfile human --power 0

The file scores.profile contains two profiles, corresponding to the two sets of effect sizes. For the first profile, the score of Sample i is

Pi = 0.3 (Xi1 - 0.88) + -0.2 (Xi2 - 0.97)

while for the second profile, the score of Sample i is

Pi = -0.1 (Xi1 - 0.88) + 0.4 (Xi2 - 0.97)

where Xi1 is the value of Sample i for Predictors 1 (the number of A alleles), while Xi2 is the value of Sample i for Predictor 2 (the number of alleles).

If we instead run

./ldak.out --calc-scores scores --scorefile scores.txt --bfile human --power 0 --pheno quant.pheno

the output is the same as before, except there is now a file called scores.cors that reports the correlation between the scores and the phenotypes provided in quant.pheno.