Summary Statistics | DougSpeed.com (original) (raw)

The page explains the format of summary statistics files, which are used to store results from single-SNP GWAS. Each file should start with a header row, and will usually have five or six columns (note that column names are case-sensitive).

The file must have the following four columns:
Predictor - the name of the SNP.
A1 - the test allele (must be a single character, e.g., A, C, G or T).
A2 - the other allele (must also be a single character, e.g., A, C, G or T).
n - number of samples used when testing the SNP (or effective sample size; see below).

Then there are four choices.

Option 1 - your files has this column:
Z - provides a Gaussian test statistic (this column can instead provide a t-statistic).

Option 2 - your file has these two columns:
BETA - reports the estimated effect size of the test allele (note that for results from logistic regression, this should report the estimated log odds).
SE - reports the standard error of the estimated effect size (note that many genetic software refer to standard errors as standard deviations).

Option 3 - your file has these two columns:
Stat - provides a chi-squared (1 degree of freedom) test statistic.
Direction - indicates whether the effect is positive or negative (with respect to the test allele). This can be an estimated effect size, or simply set to +1 and -1 for SNPs with postitive and negative effects, respectively.

Option 4 - your file has these two columns:
P - provides a p-value.
Direction - indicates whether the effect is positive or negative (with respect to the test allele). This can be an estimated effect size, or simply set to +1 and -1 for SNPs with postitive and negative effects, respectively.

Additionally, we recommend that your file contains the following column
A1Freq - provides the frequency of the A1 allele.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Points to note:

1 - SNP names must be unique, and consistent with those in the Reference Panel (see below).
2 - SNPs with multi-character alleles will be ignored.
3 - We recommend excluding predictors with ambiguous alleles (A&T or C&G) in order to avoid strand errors.
4 - If per-SNP sample sizes are not available, then use the total sample size.
5 - You should only use results from GWAS that used careful Quality Control.
6 - A summary statistics file may contain additional columns (but these will be ignored by LDAK).
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Effective sample sizes:

GWAS of binary phenotypes commonly report the effective sample size, calculated using the formula, neff=4n x pA x pU, where n is the actual sample size, while pA and pU are the proportions of cases and controls. This formula is motivated by the fact that power is highest when there are equal numbers of cases and controls (pA=pU=0.5), in which case the effective sample size equals the actual sample size (neff=n). Note that for meta-analysis, the effective sample size is calculated for each cohort separately, then summed across cohorts.

The effective sample size is most useful when estimating the liability SNP Heritability of a binary phenotype using results from a meta-analysis (in which case it is necessary to specify the proportion of cases in the GWAS using the option --ascertainment ). The standard approach is to use actual sample sizes and specify the overall ascertainment. However, Grotzinger et al. showed that this will result in substantial under-estimation of liability SNP heritability when the proportion of cases varies greatly across cohorts. Therefore, we instead recommend using effective sample sizes and --ascertainment 0.5 (while it might seem to provide a "fake ascertainment", this reflects that the actual ascertainment is taken into account when computing the per-cohort effective sample sizes).
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

SNP names:

Most GWAS results use rsIDs. However, some results instead use generic names (or a combination of rsIDs and generic names). For example, rs4747841 corresponds to a SNP located on Chromosome 10 at basepair 9960129, whose most common alleles are A and G. Therefore, this SNP could be given a generic name such as 10:9960129 or 10:9960129_A_G.

We generally find it most convenient to use rsIDs, because these are largely invariant (it is possible for an rsID to change, but this is very rare). By contrast, a problem with using generic names is that there is no fixed format, and the SNP locations depend on the genome assembly. For example, according to the Chr37/hg19 assembly, SNP rs4747841 is located on Chromosome 10 at basepair 9960129. However, according to the Chr38/hg38 assembly, it is located on Chromosome 10 at basepair 9918166.

Most importantly, the SNP names used in the summary statistics file must match those used in the reference panel. For example, if your reference panel uses rsIDs, but your summary statistics file using generic names, you will have to either update the names in the reference panel or those in the summary statistics file.

Note that the LiftOver Tool can be used to convert generic SNP names between assemblies (e.g., if you are provided with generic names based on the Chr37/hg19 assembly, but want generic names based on the Chr38/hg38 assembly).
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Example:

Here we format results from the 2014 meta-analysis of height by the GIANT Consortium and from the 2018 analysis of neuroticism by Nagel et al. These scripts exclude SNPs with matching or multi-character alleles, whose minor allele frequency is below 0.01, and with relatively low sample sizes (for the neuroticism results we additionally exclude SNPs with information score below 0.95). The scripts use the tool AWK, which is very efficient at processing large files and is usually installed by default on any UNIX operating system. You can read more about AWK here.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Download the GWAS results for height, which come from an analysis of up to 253k individuals.

wget https://portals.broadinstitute.org/collaboration/giant/images/0/01/GIANT\_HEIGHT\_Wood\_et\_al\_2014\_publicrelease\_HapMapCeuFreq.txt.gz

We can view the top of the file using the UNIX function less (note that you can use the arrow keys to move within the file; once you have finished viewing the file, press the "q" button to exit)

less -S GIANT_HEIGHT_Wood_et_al_2014_publicrelease_HapMapCeuFreq.txt.gz

We see that Column 1 contains SNP names, while Columns 2 and 3 report the allele pairs (by referring to the release notes, we find out that "Allele1" refers to the test allele). Column 4 provides the test allele frequencies, while Columns 5 and 6 report effect sizes and SEs. Lastly, Column 8 reports per-SNP sample sizes.

The following command uses gunzip, AWK and grep to extract SNP names, alleles, frequencies and sample sizes for SNPs with non-matching, single-character alleles, MAF above 0.01 and sample size above 200k. It also computes Z-test statistics (by dividing the reported effect size by the reported SE), and ensures the columns of the output file have the correct names, excluding rows that contain NAs. Note that we picked the threshold 200k, because this corresponds to about 80% of the total sample size (but other thresholds are equally valid).

gunzip -c GIANT_HEIGHT_Wood_et_al_2014_publicrelease_HapMapCeuFreq.txt.gz | awk '(NR==1){print "Predictor A1 A2 Z n A1Freq"}(NR>1){snp=$1;a1=$2;a2=$3;freq=$4;maf=freq;if(freq>0.5){maf=1-freq};effect=$5;se=$6;z=effect/se;n=$8; if(a1!=a2 && (a1=="A"||a1=="C"||a1=="G"||a1=="T") && (a2=="A"||a2=="C"||a2=="G"||a2=="T") && maf>0.01 && n>200000){print snp, a1, a2, z, n, freq}}' - | grep -v NA > height.txt

Check the top of the summary statistics file looks correct

head -n 5 height.txt
Predictor A1 A2 Z n A1Freq
rs4747841 A G -0.37931 253213 0.551
rs4749917 T C 0.37931 253213 0.436
rs737656 A G -2.06667 253116 0.367
rs737657 A G -2.06667 252156 0.358

It is sensible to check whether there are duplicate SNP names. The following command uses the UNIX functions AWK, sort and uniq to identify which names appear more than once.

awk < height.txt '{print $1}' | sort | uniq -d | head

There is no output, indicating that there are no duplicates.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Download the GWAS results for neuroticism by clicking the fifth link on this page (you can reach this page yourself by first visiting the CNCR Website, then clicking the links to download GWAS summary statistics). Note that the file you want is called sumstats_neuroticism_ctg_format.txt.gz and has size 332Mb.

We can view the top of the file using the command

less -S sumstats_neuroticism_ctg_format.txt.gz

We see that Column 1 contains SNP names, while Columns 5 and 6 report the allele pairs (by referring to the release notes, we find out that "Allele1" refers to the test allele). Column 7, 8, 9, 11 and 12 provide, respectively, test allele frequencies, minor allele frequencies, a Z-test statistic, per-SNP sample sizes and information scores. By sorting the values in Column 11, we see the maximum per-SNP sample size is 390k.

The following command uses gunzip and AWK to extract SNP names, alleles, frequencies, Z-test statistics and sample sizes for SNPs with non-matching, single-character alleles, MAF above 0.01, sample size above 350k, and information score above 0.95. Note that we picked the thresholds 350k and 0.95 because they seem reasonable, but other thresholds are equally valid.

gunzip -c sumstats_neuroticism_ctg_format.txt.gz | awk '(NR==1){print "Predictor A1 A2 Z n A1Freq"}(NR>1){snp=$1;a1=$5;a2=$6;freq=$7;maf=$8;z=$9;n=$11;info=$12; if(a1!=a2 && (a1=="A"||a1=="C"||a1=="G"||a1=="T") && (a2=="A"||a2=="C"||a2=="G"||a2=="T") && maf>0.01 && n>350000 && info>0.95){print snp, a1, a2, z, n, freq}}' - > neur.txt

Check the top of the summary statistics file looks correct

head -n 5 neur.txt
Predictor A1 A2 Z n A1Freq
rs146277091 A G 1.06 370996 0.0367696
rs3094315 A G -0.263 371912 0.829189
rs3131972 A G 0.137 372903 0.173757
rs3115860 A C 0.057 370472 0.860653

Check for duplicate SNP names

awk < neur.txt '{print $1}' | sort | uniq -d | head

Again, there is no output, indicating that there are no duplicates.