Genetic association tests using SAIGE (original) (raw)

Overview
Frequently asked questions
Installing SAIGE
Flowchart
- SAIGE
- SAIGE-GENE
Running SAIGE and SAIGE-GENE
- Single variant association tests
  * Step 1: fitting the null logistic/linear mixed model
  * Input files
  * Output files
  * Step 2: performing the single-variant association tests
  * Input files
  * Output file
  * An example with a signal
- Region- or gene-based association tests (SAIGE-GENE)
  * Step 0: creating a sparse GRM
  * Input file
  * Output files
  * Step 1: fitting the null logistic/linear mixed model
  * Input files
  * Output files
  * Step 2: performing the region- or gene-based association tests
  * Input files
  * Output files

Overview

The most current version of SAIGE is version 0.38

This tutorial will walk you through

Installing SAIGE
Running SAIGE for

Single-variant association tests
Gene-based tests
Conditional analysis

Workflow of these tests are almost the same. They use the same step 1 and step 2 functions of SAIGE. We are going to use the example data in the extdata folder in the github page.

Frequently asked questions

Which version of SAIGE/SAIGE-GENE should I use?

We are continuously adding more options to meet the needs of data analysis by users and fixing bugs, so we recommend always trying the newest version in the master branch. The docker hub will be always updated when there is new version.

What is the easiest way to install SAIGE/SAIGE-GENE?

The easiest way to use SAIGE/SAIGE-GENE is using the docker image (approach 4). If docker cannot be used in your machine, please try the installation using a conda env (approach 5)https://github.com/weizhouUMICH/SAIGE/wiki/Genetic-association-tests-using-SAIGE/_edit#installing-saige

Can SAIGE/SAIGE-GENE handle samples from different ancestry?

We haven't throughly tested the performance of SAIGE on multi-ethnic samples. SAIGE uses a random effect to account for sample relatedness/population sub-stratification through the genetic relationship matrix. For multi-ancestry samples (such as 1000 genome) you would need to include PCs to adjust for population stratification. Using both (mixed effect model + PC as covariates) would be the best approach.

For Step 1, how many markers need to be included in the plink file?

To fit the null logistic mixed model using fitNULLGLMM (step 1), ideally the number of independent markers for constructing GRM is larger than the sample size. In our previous GWAS for the UK Biobank, we used 93k markers that were pruned by UKB for kinship estimation while the sample size is ~400k (white British samples). The results look fine. We then compared the P-values from the previous analysis with 93k markers for GRM and with 340k markers for GRM for four phenotypes with different prevalences as shown in the Supplementary Figure 17 in the SAIGE paperhttps://www.ncbi.nlm.nih.gov/pubmed/30104761. We see for phenotypes such as CAD, the inflation of p-values was further corrected when 340k markers were used for GRM.
The option minMAFforGRM (default, 0.01) is used for set the minimum MAF for markers used in the plink file to estimate the GRM for step 1. Only markers with MAF >= minMAFforGRM will be used to estimate the GRM.
For gene-based tests, the same plink file should be used for both Step 0 and Step 1. This means, Step 0 creates a sparse version of the "full" GRM that is used in Step 1. The problem is that for gene-based tests, variance ratios are estimated for rare variants and common variants and for each MAC category that is specified, 30 markers will be randomly selected first and increase by 10 until the variance ratio estimation is stable. Therefore, rare variants need to be included in the plink file, although they are not used for GRM, because only markers with MAF >= minMAFforGRM will be used for GRM and the default value is 0.01. In the example of the gene-based tests for UK Biobank, you may include all rare variants with MAF < 1% and those 93k markers in the plink file for Step 0 and Step 1.

Does LOCO=TRUE work?

Version 0.36.5 or later works for LOCO=TRUE

Does SAIGE/SAIGE-GENE requires that samples in Step 1 have non-missing dosages in Step 2?

Yes. Current version of SAIGE/SAIGE-GENE requires that all samples used in Step 1 have to be in the bgen/vcf file used in Step 2.

For Step 2, how does SAIGE/SAIGE-GENE handle missing dosages/genotypes?

To drop samples with missing genotypes/dosages, --IsDropMissingDosages=TRUE, if FALSE, missing genotypes/dosages will be mean imputed.

For Step 2, what options can be used to filter out variants based on the MAF and imputation scores?

minMAF and minMAC (default 0.5) can be used to filter variants based on the MAF. The higher threshold between minMAC and minMAF will be used.
minInfo can be used to filter variants based on the imputation score. ** Please note that the saddlepoint approximation used in SAIGE for single-variant association tests does not work for variants with MAC <= 3 (usually gives extremely significant P-values). For those extremely rare variants, please use SAIGE-GENE to tests them. **

How does SAIGE/SAIGE-GENE handle multi-allelic variants in VCF files?

SAIGE/SAIGE-GENE uses the savvy library to handle VCF files. For the multi-allelic variants, genotypes/dosages for all alt alleles in one line in VCF files is not supported. The best solution is to split the variants by alleles.

How to extract the "heritability" estimated by SAIGE/SAIGE-GENE?

Tau is a vector with 2 elements. The first element is for the variance component parameter for the error term and the second one is for the GRM (genetic relationship matrix). Tau can be extracted from the null model of SAIGE results by R load("model.rda"); tau = modglmm$theta For quantitative traits from the linear mixed model, h2 = tau[2]/(tau[1]+tau[2]) For binary traits from the logistic mixed model (tau[1] is always 1), h2_liability = tau[2]/(tau[2]+pi^2/3) . But note that the heritability is the point estimate for proportion of variance of the phenotype explained by the GRM, which is not equal to the heritability explained using LDSC. Also, we have noticed that the h2 estimate for binary traits by SAIGE is underestimated and the penalized quasi-likelihood used in SAIGE is known to be biased for heritability estimation but it works well for adjusting for sample-relatedness.

List of dependencies:

R-3.6.1, gcc >= 5.4.0, cmake 3.14.1, cget
R packages: "R.utils", "Rcpp", "RcppParallel", "RcppArmadillo", "data.table", "RcppEigen", "Matrix", "methods", "BH", "optparse", "SPAtest", "SKAT","MetaSKAT"
/extdata/install_packages.R can be used to install the R packages

Installing SAIGE

Install SAIGE using the conda environment

Create a conda environment using (conda environment file) Here is a link to download the conda environment file
After downloading environment-RSAIGE.yml, run following command

  conda env create -f environment-RSAIGE.yml

Activate the conda environment RSAIGE

  conda activate RSAIGE  
  FLAGPATH=`which python | sed 's|/bin/python$||'`  
  export LDFLAGS="-L${FLAGPATH}/lib"  
  export CPPFLAGS="-I${FLAGPATH}/include"

Please make sure to set up the LDFLAGS and CPPFLAGS using export (the last two command lines), so libraries can be linked correctly when the SAIGE source code is compiled. Note: Here are the steps to create the conda environment file

Open R, run following script to install the MetaSKAT R library.

  devtools::install_github("leeshawn/MetaSKAT")

Install SAIGE from the source code.
Method 1:

  src_branch=master  
  repo_src_url=https://github.com/weizhouUMICH/SAIGE  
  git clone --depth 1 -b <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>s</mi><mi>r</mi><msub><mi>c</mi><mi>b</mi></msub><mi>r</mi><mi>a</mi><mi>n</mi><mi>c</mi><mi>h</mi></mrow><annotation encoding="application/x-tex">src_branch </annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8444em;vertical-align:-0.15em;"></span><span class="mord mathnormal" style="margin-right:0.02778em;">sr</span><span class="mord"><span class="mord mathnormal">c</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3361em;"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight">b</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mord mathnormal" style="margin-right:0.02778em;">r</span><span class="mord mathnormal">an</span><span class="mord mathnormal">c</span><span class="mord mathnormal">h</span></span></span></span>repo_src_url  
  R CMD INSTALL --library=path_to_final_SAIGE_library SAIGE

When call SAIGE in R, set lib.loc=path_to_final_SAIGE_library

  library(SAIGE, lib.loc=path_to_final_SAIGE_library)

Method 2:
Open R. Run

  devtools::install_github("weizhouUMICH/SAIGE")

Run SAIGE using a docker image

Thanks to Juha Karjalainen for sharing the Dockerfile. The docker image can be pulled

docker pull wzhou88/saige:0.38

Dockerfile for creating your own docker image for SAIGE

Functions can be called

step1_fitNULLGLMM.R --help
step2_SPAtests.R --help
createSparseGRM.R --help

Previous versions of SAIGE installation files are here https://www.dropbox.com/sh/zmlu1llpxd66pjl/AADFqdssvOBjbWZch6Q9zYNaa?dl=0

Flowchart

SAIGE and SAIGE-GENE can work for both binary and quantitative traits.

SAIGE

2 steps in SAIGE

SAIGE-GENE

2 steps in SAIGE-GENE

Running SAIGE and SAIGE-GENE

Scripts are in the folder extdata/ and command lines are in the file cmd.sh

cd /extdata/
less -S cmd.sh

Input files are in the folder extdata/input
Output files are in the folder extdata/output

To obtain help information of the scripts that call functions in the SAIGE library

Rscript createSparseGRM.R --help
Rscript step1_fitNULLGLMM.R --help
Rscript step2_SPAtests.R --help

To obtain help information of functions in the SAIGE library

#open R
R
library(SAIGE)
?createSparseGRM
?fitNULLGLMM
?SPAGMMATtest

Single variant association tests

For single-variant association tests, sparse GRM and categorical variance ratios are NOT needed. Randomly selected markers with MAC >= 20 are used to estimate the variance ratio

Step 1: fitting the null logistic/linear mixed model

For binary traits, a null logistic mixed model will be fitted (--traitType=binary).
For quantitative traits, a null linear mixed model will be fitted (--traitType=quantitative) and needs to be inverse normalized (--invNormalize=TRUE)

#check the help info for step 1
Rscript step1_fitNULLGLMM.R --help

#For Binary traits:
Rscript step1_fitNULLGLMM.R     \
        --plinkFile=./input/nfam_100_nindep_0_step1_includeMoreRareVariants_poly \
        --phenoFile=./input/pheno_1000samples.txt_withdosages_withBothTraitTypes.txt \
        --phenoCol=y_binary \
        --covarColList=x1,x2 \
        --sampleIDColinphenoFile=IID \
        --traitType=binary        \
        --outputPrefix=./output/example_binary \
        --nThreads=4 \
        --LOCO=FALSE \
        --IsOverwriteVarianceRatioFile ## v0.38. Whether to overwrite the variance ratio file if the file already exists


#For Quantitative traits, if not normally distributed, inverse normalization needs to be specified to be TRUE --invNormalize=TRUE
Rscript step1_fitNULLGLMM.R     \
        --plinkFile=./input/nfam_100_nindep_0_step1_includeMoreRareVariants_poly \
        --phenoFile=./input/pheno_1000samples.txt_withdosages_withBothTraitTypes.txt \
        --phenoCol=y_quantitative \
        --covarColList=x1,x2 \
        --sampleIDColinphenoFile=IID \
        --traitType=quantitative       \
    --invNormalize=TRUE	\
        --outputPrefix=./output/example_quantitative \
        --nThreads=4 \
        --LOCO=FALSE	\
    --tauInit=1,0

Input files

Genotype file for constructing the genetic relationship matrix (GRM)
SAIGE takes the PLINK binary file for the genotypes and assumes the file prefix is the same one for .bed, .bim. and .fam

./input/nfam_100_nindep_0_step1_includeMoreRareVariants_poly.bed
./input/nfam_100_nindep_0_step1_includeMoreRareVariants_poly.bim
./input/nfam_100_nindep_0_step1_includeMoreRareVariants_poly.fam

Phenotype file (contains non-genetic covariates if any, such as gender and age) The file can be either space or tab-delimited with a header. It is required that the file contains one column for sample IDs and one column for the phenotype. It may contain columns for non-genetic covariates.

Note: Current version of SAIGE does not support categorical covariates that have more than two categories

less -S ./input/pheno_1000samples.txt_withdosages_withBothTraitTypes.txt

Output files

model file

./output/example_quantitative.rda

#open R
R
#load the model file in R
load("./output/example_quantitative.rda")
names(modglmm)
modglmm$theta

#theta: a vector of length 2. The first element is the dispersion parameter estimate and the second one is the variance component parameter estimate
#coefficients: fixed effect parameter estimates
#linear.predictors: a vector of length N (N is the sample size) containing linear predictors
#fitted.values: a vector of length N (N is the sample size) containing fitted mean values on the original scale
#Y: a vector of length N (N is the sample size) containing final working vector
#residuals: a vector of length N (N is the sample size) containing residuals on the original scale
#sampleID: a vector of length N (N is the sample size) containing sample IDs used to fit the null model

association result file for the subset of randomly selected markers

less -S ./output/example_quantitative_30markers.SAIGE.results.txt

variance ratio file

less -S ./output/example_quantitative.varianceRatio.txt

Step 2: performing single-variant association tests

For binary traits, saddle point approximation is used to account for case-control imbalance.
File formats for dosages/genotypes of genetic variants to be tested can be used: VCF, BGEN, SAV
Conditional analysis based summary stats can be performed (--condition) can be performed in Step 2 with dosage/genotype input file formats VCF, BGEN and SAV.
To query and test a subset of markers
- both variant IDs and range of chromosome positions can be specified for BGEN input (--idstoExcludeFile, --idstoIncludeFile, --rangestoExcludeFile, --rangestoIncludeFile)
- range chromosome positions can be specified for VCF/SAV input (--chrom, --start, --end).
For VCF/SAV input, --chrom MUST be specified and the string needs to be exactly the same as in the VCF/SAV, such as "01" or "chr1".
For VCF/SAV input, --vcfField=DS to test dosages and --vcfField=GT to test genotypes
To drop samples with missing genotypes/dosages, --IsDropMissingDosages=TRUE, if FALSE, missing genotypes/dosages will be mean imputed. **NOTE: this option has not been extensively tested yet. **
sampleFile is used specify a file with sample IDs for bgen file. Please DO NOT include a header in the file. SAIGE versions >= 0.38 do not need sampleFile if VCF files are used

#check the help info for step 2
Rscript step2_SPAtests.R --help

#Perform single-variant association tests
#for binary traits, 
* --IsOutputAFinCaseCtrl=TRUE can be specified to output allele frequencies in cases and controls
* --IsOutputNinCaseCtrl=TRUE can be specified to output sample sizes of cases and controls for each marker
* --IsOutputHetHomCountsinCaseCtrl can be specified to output heterozygous and homozygous counts in cases and controls

Rscript step2_SPAtests.R	\
        --vcfFile=./input/genotype_10markers.missingness.vcf.gz \
        --vcfFileIndex=./input/genotype_10markers.missingness.vcf.gz.tbi \
        --vcfField=GT \
        --chrom=1 \
        --minMAF=0.0001 \
        --minMAC=1 \
        --sampleFile=./input/sampleIDindosage.txt \
        --GMMATmodelFile=./output/example_binary.rda \
        --varianceRatioFile=./output/example_binary.varianceRatio.txt \
        --SAIGEOutputFile=./output/example_binary.SAIGE.vcf.genotype.txt \
        --numLinesOutput=2 \
        --IsOutputAFinCaseCtrl=TRUE

##drop samples with missing genotypes/dosages
##--IsDropMissingDosages=TRUE
Rscript step2_SPAtests.R        \
        --vcfFile=./input/genotype_10markers.missingness.vcf.gz \
        --vcfFileIndex=./input/genotype_10markers.missingness.vcf.gz.tbi \
        --vcfField=GT \
        --chrom=1 \
        --minMAF=0.0001 \
        --minMAC=1 \
        --sampleFile=./input/sampleIDindosage.txt \
        --GMMATmodelFile=./output/example_binary.rda \
        --varianceRatioFile=./output/example_binary.varianceRatio.txt \
        --SAIGEOutputFile=./output/example_binary.SAIGE.vcf.genotype.dropmissing.txt \
        --numLinesOutput=2 \
        --IsOutputAFinCaseCtrl=TRUE	\
    --IsDropMissingDosages=TRUE    


#conditional analysis
## --condition = Genetic marker ids (**chr:pos_ref/alt for vcf/sav or marker id for bgen**) separated by comma. e.g.chr3:101651171_C/T,chr3:101651186_G/A

Rscript step2_SPAtests.R        \
        --vcfFile=./input/genotype_10markers.missingness.vcf.gz \
        --vcfFileIndex=./input/genotype_10markers.missingness.vcf.gz.tbi \
        --vcfField=GT \
        --chrom=1 \
        --minMAF=0.0001 \
        --minMAC=1 \
        --sampleFile=./input/sampleIDindosage.txt \
        --GMMATmodelFile=./output/example_binary.rda \
        --varianceRatioFile=./output/example_binary.varianceRatio.txt \
        --SAIGEOutputFile=./output/example_binary.SAIGE.vcf.genotype.dropmissing.txt \
        --numLinesOutput=2 \
        --IsOutputAFinCaseCtrl=TRUE     \
    --IsDropMissingDosages=TRUE	\
    --condition=1:4_A/C

#plain text input for dosages (plain text input for dosages is not inputed since 0.36.1)
Rscript step2_SPAtests.R \
        --dosageFile=./input/dosage_10markers.txt \
        --dosageFileNrowSkip=1 \
        --dosageFileNcolSkip=6 \
        --dosageFilecolnamesSkip=CHR,SNP,CM,POS,EFFECT_ALLELE,ALT_ALLELE \
        --minMAF=0.0001 \
        --sampleFile=./input/sampleIDindosage.txt \
        --GMMATmodelFile=./output/example.rda \
        --varianceRatioFile=./output/example.varianceRatio.txt \
        --SAIGEOutputFile=./output/example.plainDosage.SAIGE.txt \
        --numLinesOutput=2 \
        --IsOutputAFinCaseCtrl=TRUE

#bgen for dosages
Rscript step2_SPAtests.R \
        --bgenFile=./input/genotype_100markers.bgen \
        --bgenFileIndex=./input/genotype_100markers.bgen.bgi \
        --minMAF=0.0001 \
        --minMAC=1 \
        --sampleFile=./input/samplefileforbgen_10000samples.txt \
        --GMMATmodelFile=./output/example.rda \
        --varianceRatioFile=./output/example.varianceRatio.txt \
        --SAIGEOutputFile=./output/example.SAIGE.bgen.txt \
        --numLinesOutput=2 \
        --IsOutputAFinCaseCtrl=TRUE

#VCF containing genotypes (--vcfField=GT)
Rscript step2_SPAtests.R \
        --vcfFile=./input/genotype_10markers.vcf.gz \
        --vcfFileIndex=./input/genotype_10markers.vcf.gz.tbi \
        --vcfField=GT \
        --chrom=1 \
        --minMAF=0.0001 \
        --minMAC=1 \
        --sampleFile=./input/sampleIDindosage.txt \
        --GMMATmodelFile=./output/example.rda \
        --varianceRatioFile=./output/example.varianceRatio.txt \
        --SAIGEOutputFile=./output/example.SAIGE.vcf.genotype.txt \
        --numLinesOutput=2 \
        --IsOutputAFinCaseCtrl=TRUE

#VCF containing dosages (--vcfField=DS)
Rscript step2_SPAtests.R \
        --vcfFile=./input/dosage_10markers.vcf.gz \
        --vcfFileIndex=./input/dosage_10markers.vcf.gz.tbi \
        --vcfField=DS \
        --chrom=1 \
        --minMAF=0.0001 \
        --minMAC=1 \
        --sampleFile=./input/sampleIDindosage.txt \
        --GMMATmodelFile=./output/example.rda \
        --varianceRatioFile=./output/example.varianceRatio.txt \
        --SAIGEOutputFile=./output/example.SAIGE.vcf.dosage.txt \
        --numLinesOutput=2 \
        --IsOutputAFinCaseCtrl=TRUE

# SAV for dosages
Rscript step2_SPAtests.R        \
        --savFile=./input/dosage_10markers.sav  \
        --savFileIndex=./input/dosage_10markers.sav.s1r \
        --vcfField=DS \
        --minMAF=0.0001 \
        --minMAC=1 \
        --chrom=1 \
        --sampleFile=./input/samplefileforbgen_10000samples.txt \
        --GMMATmodelFile=./output/example.rda \
        --varianceRatioFile=./output/example.varianceRatio.txt \
        --SAIGEOutputFile=./output/example.SAIGE.sav.txt \
        --numLinesOutput=2 \
        --IsOutputAFinCaseCtrl=TRUE

Input files

Dosage file SAIGE supports different formats for dosages: VCF, BCF, BGEN and SAV.\

BGEN

genotype_100markers.bgen
genotype_100markers.bgen.bgi

VCF containing genotypes

zcat genotype_10markers.vcf.gz | less -S
genotype_10markers.vcf.gz.tbi

VCF containing dosages

zcat dosage_10markers.vcf.gz | less -S
dosage_10markers.vcf.gz.tbi

dosage_10markers.sav
dosage_10markers.sav.s1r

Sample file
This file contains one column for sample IDs corresponding to the sample order in the dosage file. **No header is included.**The option was originally for BGEN file that does not contain sample information.

less -S sampleIDindosage.txt

Model file from step 1
Variance ratio file from step 1

./output/example.varianceRatio.txt

Output file

A file with association test results

less -S ./output/example.SAIGE.vcf.dosage.txt

NOTE:

Association results are with regard to Allele2.
For binary traits, the header of the output file:CHR POS SNPID Allele1 Allele2 AC_Allele2 AF_Allele2 imputationInfo N BETA SE Tstat p.value p.value.NA Is.SPA.converge varT varTstar AF.Cases AF.Controls
For quantitative traits, the header of the output file: CHR POS SNPID Allele1 Allele2 AC_Allele2 AF_Allele2 imputationInfo N BETA SE Tstat p.value varT varTstar

CHR: chromosome
POS: genome position 
SNPID: variant ID
Allele1: allele 1
Allele2: allele 2
AC_Allele2: allele count of allele 2
AF_Allele2: allele frequency of allele 2
imputationInfo: imputation info. If not in dosage/genotype input file, will output 1
N: sample size
BETA: effect size of allele 2
SE: standard error of BETA
Tstat: score statistic of allele 2
p.value: p value (with SPA applied for binary traits)
p.value.NA: p value when SPA is not applied (only for binary traits)
Is.SPA.converge: whether SPA is converged or not (only for binary traits)
varT: estimated variance of score statistic with sample relatedness incorporated
varTstar: variance of score statistic without sample relatedness incorporated
AF.Cases: allele frequency of allele 2 in cases (only for binary traits and if --IsOutputAFinCaseCtrl=TRUE)
AF.Controls: allele frequency of allele 2 in controls (only for binary traits and if --IsOutputAFinCaseCtrl=TRUE)
Tstat_cond, p.value_cond, varT_cond, BETA_cond, SE_cond: summary stats for conditional analysis

An example with a signal

p.value in Step 2 for the marker is ~4.16 x 10^-7

Rscript step1_fitNULLGLMM.R     \
        --plinkFile=./input/nfam_100_nindep_0_step1_includeMoreRareVariants_poly \
        --phenoFile=./input/Prev_0.1_nfam_1000.pheno_positive_pheno.txt \
        --phenoCol=y \
        --covarColList=x1,x2 \
        --sampleIDColinphenoFile=IID \
        --traitType=binary        \
        --outputPrefix=./output/example_binary_positive_signal \
        --nThreads=4    \
        --LOCO=FALSE    \
        --minMAFforGRM=0.01

Rscript step2_SPAtests.R        \
        --vcfFile=./input/nfam_1000_MAF0.2_nMarker1_nseed200.vcf.gz \
        --vcfFileIndex=./input/nfam_1000_MAF0.2_nMarker1_nseed200.vcf.gz.tbi \
        --vcfField=GT \
        --chrom=1 \
        --minMAF=0.0001 \
        --minMAC=1 \
        --GMMATmodelFile=./output/example_binary_positive_signal.rda \
        --varianceRatioFile=./output/example_binary_positive_signal.varianceRatio.txt \
        --SAIGEOutputFile=./output/example_binary_positive_signal.assoc.step2.txt \
        --numLinesOutput=2 \
        --IsOutputAFinCaseCtrl=TRUE     \

Region- or gene-based association tests (SAIGE-GENE)

Step 0: creating a sparse GRM

Note: This step is only needed for region- and gene-based tests (SAIGE-GENE)
The sparse GRM only needs to be created once for each data set, e.g. a biobank, and can be used for all different phenotypes as long as all tested samples are included in it.

Rscript createSparseGRM.R	\
    --plinkFile=./input/nfam_100_nindep_0_step1_includeMoreRareVariants_poly \
    --nThreads=4  \
    --outputPrefix=./output/sparseGRM	\
    --numRandomMarkerforSparseKin=2000	\
    --relatednessCutoff=0.125

Input file

Genotype file for constructing the genetic relationship matrix
SAIGE takes the PLINK binary file for the genotypes and assumes the file prefix is the same one for .bed, .bim. and .fam

nfam_100_nindep_0_step1_includeMoreRareVariants_poly.bed
nfam_100_nindep_0_step1_includeMoreRareVariants_poly.bim
nfam_100_nindep_0_step1_includeMoreRareVariants_poly.fam

Output files

a file storing the sparse GRM

sparseGRM_relatednessCutoff_0.125_1000_randomMarkersUsed.sparseGRM.mtx

The sparse GRM file can be opened using the readMM function in the R library Matrix

a file storing IDs of the samples in the sparse GRM

sparseGRM_relatednessCutoff_0.125_1000_randomMarkersUsed.sparseGRM.mtx.sampleIDs.txt

Step 1: fitting the null logistic/linear mixed model

IsSparseKin=TRUE must be specified for SAIGE-GENE
step 1 model result from the single-variant assoc test can be re-used, except that for gene-based tests, variance ratios for multiple MAC categories and a sparse GRM need to be used (IsSparseKin=TRUE).
with IsSparseKin=TRUE, no sparseGRMFile and sparseGRMSampleIDFile are specified (step 0 has not been done beforehand), a sparse GRM will be created based on the relatednessCutoff in this step.
with IsSparseKin=TRUE, sparseGRMFile and sparseGRMSampleIDFile can be used to specify a pre-calcuated sparse GRM and the sample ids for the sparse GRM (output from step 0). Tested samples would be a subset of samples in the pre-calcuated GRM.
To activate the variance ratio estimation based multiple MAC categories, --isCateVarianceRatio=TRUE
cateVarRatioMinMACVecExclude and cateVarRatioMaxMACVecInclude are used to specify the MAC categories

#by default 
 --cateVarRatioMinMACVecExclude=0.5,1.5,2.5,3.5,4.5,5.5,10.5,20.5
 --cateVarRatioMaxMACVecInclude=1.5,2.5,3.5,4.5,5.5,10.5,20.5
corresponding to
0.5 < MAC <=  1.5
1.5 < MAC <=  2.5
2.5 < MAC <=  3.5
3.5 < MAC <=  4.5
4.5 < MAC <=  5.5
5.5 < MAC <=  10.5
10.5 < MAC <=  20.5
20.5 < MAC

Rscript step1_fitNULLGLMM.R     \
    --plinkFile=./input/nfam_100_nindep_0_step1_includeMoreRareVariants_poly \
        --phenoFile=./input/pheno_1000samples.txt_withdosages_withBothTraitTypes.txt \
        --phenoCol=y_quantitative \
        --covarColList=x1,x2 \
        --sampleIDColinphenoFile=IID \
        --traitType=quantitative       \
        --invNormalize=TRUE     \
        --outputPrefix=./output/example_quantitative \
    --outputPrefix_varRatio=./output/example_quantitative_cate	\
    --sparseGRMFile=./output/example_binary_cate.varianceRatio.txt.sparseGRM.mtx    \
        --sparseGRMSampleIDFile=./output/example_binary.varianceRatio.txt.sparseGRM.mtx.sample  \
        --nThreads=4 \
        --LOCO=FALSE	\
    --skipModelFitting=FALSE \
        --IsSparseKin=TRUE      \
        --isCateVarianceRatio=TRUE

Input files

(same as step 1 for single-variant assoc tests and step 0)
Genotype file for constructing the genetic relationship matrix in the plink format
a file storing the sparse GRM (optional, output by step 0)

sparseGRM_relatednessCutoff_0.125_1000_randomMarkersUsed.sparseGRM.mtx

The sparse GRM file can be opened using the readMM function in the R library Matrix

a file storing IDs of the samples in the sparse GRM (optional, output by step 0)

sparseGRM_relatednessCutoff_0.125_1000_randomMarkersUsed.sparseGRM.mtx.sampleIDs.txt

Output files

same as the step 1 output by SAIGE for single-variant association tests

model file
association result file for the subset of randomly selected markers
variance ratio file

specific to SAIGE-GENE

sparse Sigma file

sparseGRM_relatednessCutoff_0.125_1000_randomMarkersUsed.sparseSigma.mtx

NOTE: the file contains the sparse Sigma matrix, which is ** NOT ** the sparse GRM. The sparse Sigma matrix is computed based on the sparse GRM and the results of step 1.

Step 2: performing the region- or gene-based association tests

The command line is the same as the step 2 for single-variant assoc tests, except that a group file is specified (--groupFile)
Each line is for one gene/set of variants. The first element is for gene/set name.
The rest of the line is for variant ids included in this gene/set. For vcf/sav, the genetic marker ids are in the format chr:pos_ref/alt. For bgen, the genetic marker ids should match the ids in the bgen file. Each element in the line is separated by tab.The marker ids in the group file for vcf/sav need to be sorted by chr and pos.
IsSingleVarinGroupTest=TRUE is to perform single-variant association tests as well for markers included in the gene-based tests
--IsOutputBETASEinBurdenTest is to output effect sizes for burden tests (this option is still under development)
Same as the single-variant association tests, conditional analysis based summary stats can be performed (--condition) can be performed in step 2 with dosage/genotype input file formats VCF, BGEN and SAV.

Rscript step2_SPAtests.R \
        --vcfFile=./input/genotype_10markers.vcf.gz \
        --vcfFileIndex=./input/genotype_10markers.vcf.gz.tbi \
        --vcfField=GT \
        --chrom=1 \
    --minMAF=0 \
        --minMAC=0.5 \
        --maxMAFforGroupTest=0.01       \
        --sampleFile=./input/samplelist.txt \
        --GMMATmodelFile=./output/example_quantitative.rda \
        --varianceRatioFile=./output/example_quantitative_cate.varianceRatio.txt \
        --SAIGEOutputFile=./output/example_quantitative.SAIGE.gene.txt \
        --numLinesOutput=1 \
        --groupFile=./input/groupFile_geneBasedtest_simulation.txt    \
        --sparseSigmaFile=./output/example_quantitative_cate.varianceRatio.txt_relatednessCutoff_0.125.sparseSigma.mtx       \
        --IsSingleVarinGroupTest=TRUE


##another example, conditional analysis for gene-based tests
Rscript step2_SPAtests.R \
        --vcfFile=./input/seedNumLow_126001_seedNumHigh_127000_nfam_1000_nindep_0.sav \
        --vcfFileIndex=./input/seedNumLow_126001_seedNumHigh_127000_nfam_1000_nindep_0.sav.s1r \
        --vcfField=DS \
        --chrom=chr1 \
        --minMAF=0 \
        --minMAC=0.5 \
        --maxMAFforGroupTest=0.01       \
        --sampleFile=./input/samplelist.txt \
        --GMMATmodelFile=./output/example_quantitative.rda \
        --varianceRatioFile=./output/example_quantitative_cate.varianceRatio.txt \
    --SAIGEOutputFile=./output/example_quantitative.SAIGE.gene_conditional.txt \
        --numLinesOutput=1 \
        --groupFile=./input/groupFile_geneBasedtest.txt    \
        --sparseSigmaFile=./output/example_quantitative_cate.varianceRatio.txt_relatednessCutoff_0.125.sparseSigma.mtx       \
        --IsOutputAFinCaseCtrl=TRUE     \
        --IsSingleVarinGroupTest=TRUE   \
    --condition=chr1:32302_A/C 


##Specify customized weights for markers in the gene- or region-based tests
* weightsIncludeinGroupFile logical. Whether to specify customized weight for makers in gene- or region-based tests. If TRUE, weights are included in the group file. For vcf/sav, the genetic marker ids and weights are in the format chr:pos_ref/alt;weight. For bgen, the genetic marker ids should match the ids in the bgen filE, e.g. SNPID;weight. Each element in the line is seperated by tab. By default, FALSE
* weights_for_G2_cond vector of float. weights for conditioning markers for gene- or region-based tests. The length equals to the number of conditioning markers, delimited by comma.

Rscript step2_SPAtests.R \
        --vcfFile=./input/seedNumLow_126001_seedNumHigh_127000_nfam_1000_nindep_0.sav \
        --vcfFileIndex=./input/seedNumLow_126001_seedNumHigh_127000_nfam_1000_nindep_0.sav.s1r \
        --vcfField=DS \
        --chrom=chr1 \
        --minMAF=0 \
        --minMAC=0.5 \
        --maxMAFforGroupTest=0.01       \
        --sampleFile=./input/samplelist.txt \
        --GMMATmodelFile=./output/example_binary.rda \
        --varianceRatioFile=./output/example_binary_cate_v2.varianceRatio.txt \
        --SAIGEOutputFile=./output/example_binary.SAIGE.gene_conditional_withspecifiedWeights.txt \
        --numLinesOutput=1 \
        --groupFile=./input/groupFile_geneBasedtest_withWeights.txt    \
        --sparseSigmaFile=./output/example_binary_cate_v2.varianceRatio.txt_relatednessCutoff_0.125.sparseSigma.mtx       \
        --IsOutputAFinCaseCtrl=TRUE     \
        --IsSingleVarinGroupTest=TRUE   \
        --IsOutputPvalueNAinGroupTestforBinary=TRUE     \
        --IsAccountforCasecontrolImbalanceinGroupTest=TRUE      \
        --weightsIncludeinGroupFile=TRUE        \
        --weights_for_G2_cond=3,1       \
        --condition=chr1:32302_A/C,chr1:32304_A/C

Input files

same as step 1 input by SAIGE for single-variant association tests

Dosage file
SAIGE-GENE can take in VCF, BCF, BGEN, and SAV.
Sample file
This file contains one column for sample IDs corresponding to the sample order in the dosage file. No header is included.

less -S sampleIDindosage.txt

Model file from step 1
Variance ratio file from step 1

./output/example.varianceRatio.txt

specific to SAIGE-GENE

Group file
Each line is for one gene/set of variants. The first element is for gene/set name.
The rest of the line is for variant ids included in this gene/set. For vcf/sav, the genetic marker ids are in the format chr:pos_ref/alt. For bgen, the genetic marker ids should match the ids in the bgen file. Each element in the line is separated by tab.
** Note that the order of variants in the group file needs to be consistent to variants' order in the dosage file ** To specify customized weights for variants, set weightsIncludeinGroupFile=TRUE and in the group file, for vcf/sav, the genetic marker ids and weights are in the format chr:pos_ref/alt;weight. For bgen, the genetic marker ids should match the ids in the bgen filE, e.g. SNPID;weight. Each element in the line is seperated by tab.**
** If weightsIncludeinGroupFile=TRUE, in conditional analysis, weights for conditioning markers need to be specified using the argument weights_for_G2_cond, e.g. weights_for_G2_cond=1,3,4. The weights in --weights_for_G2_cond are for markers in --condition, respectively.**

less -S ./input/groupFile_geneBasedtest.txt

sparse Sigma file

sparseGRM_relatednessCutoff_0.125_1000_randomMarkersUsed.sparseSigma.mtx

Output files

A file with region- or gene-based association test results

less -S ./output/example_quantitative.SAIGE.gene_conditional.txt

NOTE:

The header of the output file:Gene Pvalue Pvalue_cond Nmarker_MACCate_1 Nmarker_MACCate_2 Nmarker_MACCate_3 Nmarker_MACCate_4 Nmarker_MACCate_5 Nmarker_MACCate_6 Nmarker_MACCate_7 Nmarker_MACCate_8 markerIDs markerAFs Pvalue_Burden Pvalue_SKAT Pvalue_Burden_cond Pvalue_SKAT_cond

Gene : Gene name (as provided in the group file)
Pvalue: p-value of SKAT-O test
Pvalue_cond: conditional p-value of SKAT-O test (if --condition= is specified)
Nmarker_MACCate_n: number of markers in the gene falling in the nth MAC category (The MAC category is corresponding to the one used for the variance ratio estimation). For example, by default, Nmarker_MACCate_1 is the number of singletons
markerIDs: IDs for variants included in the test
markerAFs: allele frequencies for variants included in the test
Pvalue_Burden: p-value of Burden test
Pvalue_SKAT: p-value of SKAT test
Pvalue_Burden_cond: conditional p-value of Burden test (if --condition= is specified)
Pvalue_SKAT_cond: conditional p-value of SKAT test (if --condition= is specified)
Pvalue_SKAT.NA: p-values of SKAT test without accounting for unbalanced case-control ratios for binary traits
Pvalue.NA: p-values of SKAT-O test without accounting for unbalanced case-control ratios for binary traits
Pvalue_Burden.NA: p-values of Burden test without accounting for unbalanced case-control ratios for binary traits

A file with single-variant association test results for genetic variants included in the gene-based tests (if --IsSingleVarinGroupTest=TRUE)

less -S ./output/example_quantitative.SAIGE.gene_conditional.txt_single

Same as the output by SAIGE for single-variant association tests