Genetic association tests using SAIGE (original) (raw)

Overview

The most current version of SAIGE is version 0.38

This tutorial will walk you through

  1. Installing SAIGE
  2. Running SAIGE for

Workflow of these tests are almost the same. They use the same step 1 and step 2 functions of SAIGE. We are going to use the example data in the extdata folder in the github page.

Frequently asked questions

  1. Which version of SAIGE/SAIGE-GENE should I use?
  1. What is the easiest way to install SAIGE/SAIGE-GENE?
  1. Can SAIGE/SAIGE-GENE handle samples from different ancestry?
  1. For Step 1, how many markers need to be included in the plink file?
  1. Does LOCO=TRUE work?
  1. Does SAIGE/SAIGE-GENE requires that samples in Step 1 have non-missing dosages in Step 2?
  1. For Step 2, how does SAIGE/SAIGE-GENE handle missing dosages/genotypes?
  1. For Step 2, what options can be used to filter out variants based on the MAF and imputation scores?
  1. How does SAIGE/SAIGE-GENE handle multi-allelic variants in VCF files?

SAIGE/SAIGE-GENE uses the savvy library to handle VCF files. For the multi-allelic variants, genotypes/dosages for all alt alleles in one line in VCF files is not supported. The best solution is to split the variants by alleles.

  1. How to extract the "heritability" estimated by SAIGE/SAIGE-GENE?

Tau is a vector with 2 elements. The first element is for the variance component parameter for the error term and the second one is for the GRM (genetic relationship matrix). Tau can be extracted from the null model of SAIGE results by R load("model.rda"); tau = modglmm$theta For quantitative traits from the linear mixed model, h2 = tau[2]/(tau[1]+tau[2]) For binary traits from the logistic mixed model (tau[1] is always 1), h2_liability = tau[2]/(tau[2]+pi^2/3) . But note that the heritability is the point estimate for proportion of variance of the phenotype explained by the GRM, which is not equal to the heritability explained using LDSC. Also, we have noticed that the h2 estimate for binary traits by SAIGE is underestimated and the penalized quasi-likelihood used in SAIGE is known to be biased for heritability estimation but it works well for adjusting for sample-relatedness.

List of dependencies:

Installing SAIGE

Install SAIGE using the conda environment

  1. Create a conda environment using (conda environment file) Here is a link to download the conda environment file
    After downloading environment-RSAIGE.yml, run following command
  conda env create -f environment-RSAIGE.yml  
  1. Activate the conda environment RSAIGE
  conda activate RSAIGE  
  FLAGPATH=`which python | sed 's|/bin/python$||'`  
  export LDFLAGS="-L${FLAGPATH}/lib"  
  export CPPFLAGS="-I${FLAGPATH}/include"  

Please make sure to set up the LDFLAGS and CPPFLAGS using export (the last two command lines), so libraries can be linked correctly when the SAIGE source code is compiled. Note: Here are the steps to create the conda environment file

  1. Open R, run following script to install the MetaSKAT R library.
  devtools::install_github("leeshawn/MetaSKAT")  
  1. Install SAIGE from the source code.
    Method 1:
  src_branch=master  
  repo_src_url=https://github.com/weizhouUMICH/SAIGE  
  git clone --depth 1 -b <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>s</mi><mi>r</mi><msub><mi>c</mi><mi>b</mi></msub><mi>r</mi><mi>a</mi><mi>n</mi><mi>c</mi><mi>h</mi></mrow><annotation encoding="application/x-tex">src_branch </annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8444em;vertical-align:-0.15em;"></span><span class="mord mathnormal" style="margin-right:0.02778em;">sr</span><span class="mord"><span class="mord mathnormal">c</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3361em;"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight">b</span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mord mathnormal" style="margin-right:0.02778em;">r</span><span class="mord mathnormal">an</span><span class="mord mathnormal">c</span><span class="mord mathnormal">h</span></span></span></span>repo_src_url  
  R CMD INSTALL --library=path_to_final_SAIGE_library SAIGE  

When call SAIGE in R, set lib.loc=path_to_final_SAIGE_library

  library(SAIGE, lib.loc=path_to_final_SAIGE_library)  

Method 2:
Open R. Run

  devtools::install_github("weizhouUMICH/SAIGE")  

Run SAIGE using a docker image

Thanks to Juha Karjalainen for sharing the Dockerfile. The docker image can be pulled

docker pull wzhou88/saige:0.38

Dockerfile for creating your own docker image for SAIGE

Functions can be called

step1_fitNULLGLMM.R --help
step2_SPAtests.R --help
createSparseGRM.R --help

Flowchart

SAIGE and SAIGE-GENE can work for both binary and quantitative traits.

SAIGE

2 steps in SAIGE

SAIGE-GENE

2 steps in SAIGE-GENE

Running SAIGE and SAIGE-GENE

Scripts are in the folder extdata/ and command lines are in the file cmd.sh

cd /extdata/
less -S cmd.sh

Input files are in the folder extdata/input
Output files are in the folder extdata/output

To obtain help information of the scripts that call functions in the SAIGE library

Rscript createSparseGRM.R --help
Rscript step1_fitNULLGLMM.R --help
Rscript step2_SPAtests.R --help

To obtain help information of functions in the SAIGE library

#open R
R
library(SAIGE)
?createSparseGRM
?fitNULLGLMM
?SPAGMMATtest

Single variant association tests

For single-variant association tests, sparse GRM and categorical variance ratios are NOT needed. Randomly selected markers with MAC >= 20 are used to estimate the variance ratio

Step 1: fitting the null logistic/linear mixed model

#check the help info for step 1
Rscript step1_fitNULLGLMM.R --help
#For Binary traits:
Rscript step1_fitNULLGLMM.R     \
        --plinkFile=./input/nfam_100_nindep_0_step1_includeMoreRareVariants_poly \
        --phenoFile=./input/pheno_1000samples.txt_withdosages_withBothTraitTypes.txt \
        --phenoCol=y_binary \
        --covarColList=x1,x2 \
        --sampleIDColinphenoFile=IID \
        --traitType=binary        \
        --outputPrefix=./output/example_binary \
        --nThreads=4 \
        --LOCO=FALSE \
        --IsOverwriteVarianceRatioFile ## v0.38. Whether to overwrite the variance ratio file if the file already exists


#For Quantitative traits, if not normally distributed, inverse normalization needs to be specified to be TRUE --invNormalize=TRUE
Rscript step1_fitNULLGLMM.R     \
        --plinkFile=./input/nfam_100_nindep_0_step1_includeMoreRareVariants_poly \
        --phenoFile=./input/pheno_1000samples.txt_withdosages_withBothTraitTypes.txt \
        --phenoCol=y_quantitative \
        --covarColList=x1,x2 \
        --sampleIDColinphenoFile=IID \
        --traitType=quantitative       \
    --invNormalize=TRUE	\
        --outputPrefix=./output/example_quantitative \
        --nThreads=4 \
        --LOCO=FALSE	\
    --tauInit=1,0
Input files

  1. Genotype file for constructing the genetic relationship matrix (GRM)
    SAIGE takes the PLINK binary file for the genotypes and assumes the file prefix is the same one for .bed, .bim. and .fam
./input/nfam_100_nindep_0_step1_includeMoreRareVariants_poly.bed
./input/nfam_100_nindep_0_step1_includeMoreRareVariants_poly.bim
./input/nfam_100_nindep_0_step1_includeMoreRareVariants_poly.fam
  1. Phenotype file (contains non-genetic covariates if any, such as gender and age) The file can be either space or tab-delimited with a header. It is required that the file contains one column for sample IDs and one column for the phenotype. It may contain columns for non-genetic covariates.

Note: Current version of SAIGE does not support categorical covariates that have more than two categories

less -S ./input/pheno_1000samples.txt_withdosages_withBothTraitTypes.txt
Output files

  1. model file
./output/example_quantitative.rda
#open R
R
#load the model file in R
load("./output/example_quantitative.rda")
names(modglmm)
modglmm$theta

#theta: a vector of length 2. The first element is the dispersion parameter estimate and the second one is the variance component parameter estimate
#coefficients: fixed effect parameter estimates
#linear.predictors: a vector of length N (N is the sample size) containing linear predictors
#fitted.values: a vector of length N (N is the sample size) containing fitted mean values on the original scale
#Y: a vector of length N (N is the sample size) containing final working vector
#residuals: a vector of length N (N is the sample size) containing residuals on the original scale
#sampleID: a vector of length N (N is the sample size) containing sample IDs used to fit the null model
  1. association result file for the subset of randomly selected markers
less -S ./output/example_quantitative_30markers.SAIGE.results.txt
  1. variance ratio file
less -S ./output/example_quantitative.varianceRatio.txt

Step 2: performing single-variant association tests

#check the help info for step 2
Rscript step2_SPAtests.R --help
#Perform single-variant association tests
#for binary traits, 
* --IsOutputAFinCaseCtrl=TRUE can be specified to output allele frequencies in cases and controls
* --IsOutputNinCaseCtrl=TRUE can be specified to output sample sizes of cases and controls for each marker
* --IsOutputHetHomCountsinCaseCtrl can be specified to output heterozygous and homozygous counts in cases and controls

Rscript step2_SPAtests.R	\
        --vcfFile=./input/genotype_10markers.missingness.vcf.gz \
        --vcfFileIndex=./input/genotype_10markers.missingness.vcf.gz.tbi \
        --vcfField=GT \
        --chrom=1 \
        --minMAF=0.0001 \
        --minMAC=1 \
        --sampleFile=./input/sampleIDindosage.txt \
        --GMMATmodelFile=./output/example_binary.rda \
        --varianceRatioFile=./output/example_binary.varianceRatio.txt \
        --SAIGEOutputFile=./output/example_binary.SAIGE.vcf.genotype.txt \
        --numLinesOutput=2 \
        --IsOutputAFinCaseCtrl=TRUE

##drop samples with missing genotypes/dosages
##--IsDropMissingDosages=TRUE
Rscript step2_SPAtests.R        \
        --vcfFile=./input/genotype_10markers.missingness.vcf.gz \
        --vcfFileIndex=./input/genotype_10markers.missingness.vcf.gz.tbi \
        --vcfField=GT \
        --chrom=1 \
        --minMAF=0.0001 \
        --minMAC=1 \
        --sampleFile=./input/sampleIDindosage.txt \
        --GMMATmodelFile=./output/example_binary.rda \
        --varianceRatioFile=./output/example_binary.varianceRatio.txt \
        --SAIGEOutputFile=./output/example_binary.SAIGE.vcf.genotype.dropmissing.txt \
        --numLinesOutput=2 \
        --IsOutputAFinCaseCtrl=TRUE	\
    --IsDropMissingDosages=TRUE    


#conditional analysis
## --condition = Genetic marker ids (**chr:pos_ref/alt for vcf/sav or marker id for bgen**) separated by comma. e.g.chr3:101651171_C/T,chr3:101651186_G/A

Rscript step2_SPAtests.R        \
        --vcfFile=./input/genotype_10markers.missingness.vcf.gz \
        --vcfFileIndex=./input/genotype_10markers.missingness.vcf.gz.tbi \
        --vcfField=GT \
        --chrom=1 \
        --minMAF=0.0001 \
        --minMAC=1 \
        --sampleFile=./input/sampleIDindosage.txt \
        --GMMATmodelFile=./output/example_binary.rda \
        --varianceRatioFile=./output/example_binary.varianceRatio.txt \
        --SAIGEOutputFile=./output/example_binary.SAIGE.vcf.genotype.dropmissing.txt \
        --numLinesOutput=2 \
        --IsOutputAFinCaseCtrl=TRUE     \
    --IsDropMissingDosages=TRUE	\
    --condition=1:4_A/C

#plain text input for dosages (plain text input for dosages is not inputed since 0.36.1)
Rscript step2_SPAtests.R \
        --dosageFile=./input/dosage_10markers.txt \
        --dosageFileNrowSkip=1 \
        --dosageFileNcolSkip=6 \
        --dosageFilecolnamesSkip=CHR,SNP,CM,POS,EFFECT_ALLELE,ALT_ALLELE \
        --minMAF=0.0001 \
        --sampleFile=./input/sampleIDindosage.txt \
        --GMMATmodelFile=./output/example.rda \
        --varianceRatioFile=./output/example.varianceRatio.txt \
        --SAIGEOutputFile=./output/example.plainDosage.SAIGE.txt \
        --numLinesOutput=2 \
        --IsOutputAFinCaseCtrl=TRUE

#bgen for dosages
Rscript step2_SPAtests.R \
        --bgenFile=./input/genotype_100markers.bgen \
        --bgenFileIndex=./input/genotype_100markers.bgen.bgi \
        --minMAF=0.0001 \
        --minMAC=1 \
        --sampleFile=./input/samplefileforbgen_10000samples.txt \
        --GMMATmodelFile=./output/example.rda \
        --varianceRatioFile=./output/example.varianceRatio.txt \
        --SAIGEOutputFile=./output/example.SAIGE.bgen.txt \
        --numLinesOutput=2 \
        --IsOutputAFinCaseCtrl=TRUE

#VCF containing genotypes (--vcfField=GT)
Rscript step2_SPAtests.R \
        --vcfFile=./input/genotype_10markers.vcf.gz \
        --vcfFileIndex=./input/genotype_10markers.vcf.gz.tbi \
        --vcfField=GT \
        --chrom=1 \
        --minMAF=0.0001 \
        --minMAC=1 \
        --sampleFile=./input/sampleIDindosage.txt \
        --GMMATmodelFile=./output/example.rda \
        --varianceRatioFile=./output/example.varianceRatio.txt \
        --SAIGEOutputFile=./output/example.SAIGE.vcf.genotype.txt \
        --numLinesOutput=2 \
        --IsOutputAFinCaseCtrl=TRUE

#VCF containing dosages (--vcfField=DS)
Rscript step2_SPAtests.R \
        --vcfFile=./input/dosage_10markers.vcf.gz \
        --vcfFileIndex=./input/dosage_10markers.vcf.gz.tbi \
        --vcfField=DS \
        --chrom=1 \
        --minMAF=0.0001 \
        --minMAC=1 \
        --sampleFile=./input/sampleIDindosage.txt \
        --GMMATmodelFile=./output/example.rda \
        --varianceRatioFile=./output/example.varianceRatio.txt \
        --SAIGEOutputFile=./output/example.SAIGE.vcf.dosage.txt \
        --numLinesOutput=2 \
        --IsOutputAFinCaseCtrl=TRUE

# SAV for dosages
Rscript step2_SPAtests.R        \
        --savFile=./input/dosage_10markers.sav  \
        --savFileIndex=./input/dosage_10markers.sav.s1r \
        --vcfField=DS \
        --minMAF=0.0001 \
        --minMAC=1 \
        --chrom=1 \
        --sampleFile=./input/samplefileforbgen_10000samples.txt \
        --GMMATmodelFile=./output/example.rda \
        --varianceRatioFile=./output/example.varianceRatio.txt \
        --SAIGEOutputFile=./output/example.SAIGE.sav.txt \
        --numLinesOutput=2 \
        --IsOutputAFinCaseCtrl=TRUE
Input files

  1. Dosage file SAIGE supports different formats for dosages: VCF, BCF, BGEN and SAV.\
genotype_100markers.bgen
genotype_100markers.bgen.bgi 
zcat genotype_10markers.vcf.gz | less -S
genotype_10markers.vcf.gz.tbi
zcat dosage_10markers.vcf.gz | less -S
dosage_10markers.vcf.gz.tbi
dosage_10markers.sav
dosage_10markers.sav.s1r
  1. Sample file
    This file contains one column for sample IDs corresponding to the sample order in the dosage file. **No header is included.**The option was originally for BGEN file that does not contain sample information.
less -S sampleIDindosage.txt
  1. Model file from step 1
  2. Variance ratio file from step 1
./output/example.varianceRatio.txt
Output file

A file with association test results

less -S ./output/example.SAIGE.vcf.dosage.txt

NOTE:

CHR: chromosome
POS: genome position 
SNPID: variant ID
Allele1: allele 1
Allele2: allele 2
AC_Allele2: allele count of allele 2
AF_Allele2: allele frequency of allele 2
imputationInfo: imputation info. If not in dosage/genotype input file, will output 1
N: sample size
BETA: effect size of allele 2
SE: standard error of BETA
Tstat: score statistic of allele 2
p.value: p value (with SPA applied for binary traits)
p.value.NA: p value when SPA is not applied (only for binary traits)
Is.SPA.converge: whether SPA is converged or not (only for binary traits)
varT: estimated variance of score statistic with sample relatedness incorporated
varTstar: variance of score statistic without sample relatedness incorporated
AF.Cases: allele frequency of allele 2 in cases (only for binary traits and if --IsOutputAFinCaseCtrl=TRUE)
AF.Controls: allele frequency of allele 2 in controls (only for binary traits and if --IsOutputAFinCaseCtrl=TRUE)
Tstat_cond, p.value_cond, varT_cond, BETA_cond, SE_cond: summary stats for conditional analysis

An example with a signal

Rscript step1_fitNULLGLMM.R     \
        --plinkFile=./input/nfam_100_nindep_0_step1_includeMoreRareVariants_poly \
        --phenoFile=./input/Prev_0.1_nfam_1000.pheno_positive_pheno.txt \
        --phenoCol=y \
        --covarColList=x1,x2 \
        --sampleIDColinphenoFile=IID \
        --traitType=binary        \
        --outputPrefix=./output/example_binary_positive_signal \
        --nThreads=4    \
        --LOCO=FALSE    \
        --minMAFforGRM=0.01

Rscript step2_SPAtests.R        \
        --vcfFile=./input/nfam_1000_MAF0.2_nMarker1_nseed200.vcf.gz \
        --vcfFileIndex=./input/nfam_1000_MAF0.2_nMarker1_nseed200.vcf.gz.tbi \
        --vcfField=GT \
        --chrom=1 \
        --minMAF=0.0001 \
        --minMAC=1 \
        --GMMATmodelFile=./output/example_binary_positive_signal.rda \
        --varianceRatioFile=./output/example_binary_positive_signal.varianceRatio.txt \
        --SAIGEOutputFile=./output/example_binary_positive_signal.assoc.step2.txt \
        --numLinesOutput=2 \
        --IsOutputAFinCaseCtrl=TRUE     \

Region- or gene-based association tests (SAIGE-GENE)

Step 0: creating a sparse GRM

Note: This step is only needed for region- and gene-based tests (SAIGE-GENE)
The sparse GRM only needs to be created once for each data set, e.g. a biobank, and can be used for all different phenotypes as long as all tested samples are included in it.

Rscript createSparseGRM.R	\
    --plinkFile=./input/nfam_100_nindep_0_step1_includeMoreRareVariants_poly \
    --nThreads=4  \
    --outputPrefix=./output/sparseGRM	\
    --numRandomMarkerforSparseKin=2000	\
    --relatednessCutoff=0.125
Input file

Genotype file for constructing the genetic relationship matrix
SAIGE takes the PLINK binary file for the genotypes and assumes the file prefix is the same one for .bed, .bim. and .fam

nfam_100_nindep_0_step1_includeMoreRareVariants_poly.bed
nfam_100_nindep_0_step1_includeMoreRareVariants_poly.bim
nfam_100_nindep_0_step1_includeMoreRareVariants_poly.fam
Output files

  1. a file storing the sparse GRM
sparseGRM_relatednessCutoff_0.125_1000_randomMarkersUsed.sparseGRM.mtx

The sparse GRM file can be opened using the readMM function in the R library Matrix

  1. a file storing IDs of the samples in the sparse GRM
sparseGRM_relatednessCutoff_0.125_1000_randomMarkersUsed.sparseGRM.mtx.sampleIDs.txt

Step 1: fitting the null logistic/linear mixed model

#by default 
 --cateVarRatioMinMACVecExclude=0.5,1.5,2.5,3.5,4.5,5.5,10.5,20.5
 --cateVarRatioMaxMACVecInclude=1.5,2.5,3.5,4.5,5.5,10.5,20.5
corresponding to
0.5 < MAC <=  1.5
1.5 < MAC <=  2.5
2.5 < MAC <=  3.5
3.5 < MAC <=  4.5
4.5 < MAC <=  5.5
5.5 < MAC <=  10.5
10.5 < MAC <=  20.5
20.5 < MAC
Rscript step1_fitNULLGLMM.R     \
    --plinkFile=./input/nfam_100_nindep_0_step1_includeMoreRareVariants_poly \
        --phenoFile=./input/pheno_1000samples.txt_withdosages_withBothTraitTypes.txt \
        --phenoCol=y_quantitative \
        --covarColList=x1,x2 \
        --sampleIDColinphenoFile=IID \
        --traitType=quantitative       \
        --invNormalize=TRUE     \
        --outputPrefix=./output/example_quantitative \
    --outputPrefix_varRatio=./output/example_quantitative_cate	\
    --sparseGRMFile=./output/example_binary_cate.varianceRatio.txt.sparseGRM.mtx    \
        --sparseGRMSampleIDFile=./output/example_binary.varianceRatio.txt.sparseGRM.mtx.sample  \
        --nThreads=4 \
        --LOCO=FALSE	\
    --skipModelFitting=FALSE \
        --IsSparseKin=TRUE      \
        --isCateVarianceRatio=TRUE	
Input files

  1. (same as step 1 for single-variant assoc tests and step 0)
    Genotype file for constructing the genetic relationship matrix in the plink format
  2. a file storing the sparse GRM (optional, output by step 0)
sparseGRM_relatednessCutoff_0.125_1000_randomMarkersUsed.sparseGRM.mtx

The sparse GRM file can be opened using the readMM function in the R library Matrix

  1. a file storing IDs of the samples in the sparse GRM (optional, output by step 0)
sparseGRM_relatednessCutoff_0.125_1000_randomMarkersUsed.sparseGRM.mtx.sampleIDs.txt
Output files

  1. model file
  2. association result file for the subset of randomly selected markers
  3. variance ratio file
  1. sparse Sigma file
sparseGRM_relatednessCutoff_0.125_1000_randomMarkersUsed.sparseSigma.mtx

NOTE: the file contains the sparse Sigma matrix, which is ** NOT ** the sparse GRM. The sparse Sigma matrix is computed based on the sparse GRM and the results of step 1.

Step 2: performing the region- or gene-based association tests

Rscript step2_SPAtests.R \
        --vcfFile=./input/genotype_10markers.vcf.gz \
        --vcfFileIndex=./input/genotype_10markers.vcf.gz.tbi \
        --vcfField=GT \
        --chrom=1 \
    --minMAF=0 \
        --minMAC=0.5 \
        --maxMAFforGroupTest=0.01       \
        --sampleFile=./input/samplelist.txt \
        --GMMATmodelFile=./output/example_quantitative.rda \
        --varianceRatioFile=./output/example_quantitative_cate.varianceRatio.txt \
        --SAIGEOutputFile=./output/example_quantitative.SAIGE.gene.txt \
        --numLinesOutput=1 \
        --groupFile=./input/groupFile_geneBasedtest_simulation.txt    \
        --sparseSigmaFile=./output/example_quantitative_cate.varianceRatio.txt_relatednessCutoff_0.125.sparseSigma.mtx       \
        --IsSingleVarinGroupTest=TRUE


##another example, conditional analysis for gene-based tests
Rscript step2_SPAtests.R \
        --vcfFile=./input/seedNumLow_126001_seedNumHigh_127000_nfam_1000_nindep_0.sav \
        --vcfFileIndex=./input/seedNumLow_126001_seedNumHigh_127000_nfam_1000_nindep_0.sav.s1r \
        --vcfField=DS \
        --chrom=chr1 \
        --minMAF=0 \
        --minMAC=0.5 \
        --maxMAFforGroupTest=0.01       \
        --sampleFile=./input/samplelist.txt \
        --GMMATmodelFile=./output/example_quantitative.rda \
        --varianceRatioFile=./output/example_quantitative_cate.varianceRatio.txt \
    --SAIGEOutputFile=./output/example_quantitative.SAIGE.gene_conditional.txt \
        --numLinesOutput=1 \
        --groupFile=./input/groupFile_geneBasedtest.txt    \
        --sparseSigmaFile=./output/example_quantitative_cate.varianceRatio.txt_relatednessCutoff_0.125.sparseSigma.mtx       \
        --IsOutputAFinCaseCtrl=TRUE     \
        --IsSingleVarinGroupTest=TRUE   \
    --condition=chr1:32302_A/C 


##Specify customized weights for markers in the gene- or region-based tests
* weightsIncludeinGroupFile logical. Whether to specify customized weight for makers in gene- or region-based tests. If TRUE, weights are included in the group file. For vcf/sav, the genetic marker ids and weights are in the format chr:pos_ref/alt;weight. For bgen, the genetic marker ids should match the ids in the bgen filE, e.g. SNPID;weight. Each element in the line is seperated by tab. By default, FALSE
* weights_for_G2_cond vector of float. weights for conditioning markers for gene- or region-based tests. The length equals to the number of conditioning markers, delimited by comma.

Rscript step2_SPAtests.R \
        --vcfFile=./input/seedNumLow_126001_seedNumHigh_127000_nfam_1000_nindep_0.sav \
        --vcfFileIndex=./input/seedNumLow_126001_seedNumHigh_127000_nfam_1000_nindep_0.sav.s1r \
        --vcfField=DS \
        --chrom=chr1 \
        --minMAF=0 \
        --minMAC=0.5 \
        --maxMAFforGroupTest=0.01       \
        --sampleFile=./input/samplelist.txt \
        --GMMATmodelFile=./output/example_binary.rda \
        --varianceRatioFile=./output/example_binary_cate_v2.varianceRatio.txt \
        --SAIGEOutputFile=./output/example_binary.SAIGE.gene_conditional_withspecifiedWeights.txt \
        --numLinesOutput=1 \
        --groupFile=./input/groupFile_geneBasedtest_withWeights.txt    \
        --sparseSigmaFile=./output/example_binary_cate_v2.varianceRatio.txt_relatednessCutoff_0.125.sparseSigma.mtx       \
        --IsOutputAFinCaseCtrl=TRUE     \
        --IsSingleVarinGroupTest=TRUE   \
        --IsOutputPvalueNAinGroupTestforBinary=TRUE     \
        --IsAccountforCasecontrolImbalanceinGroupTest=TRUE      \
        --weightsIncludeinGroupFile=TRUE        \
        --weights_for_G2_cond=3,1       \
        --condition=chr1:32302_A/C,chr1:32304_A/C

Input files

  1. Dosage file
    SAIGE-GENE can take in VCF, BCF, BGEN, and SAV.
  2. Sample file
    This file contains one column for sample IDs corresponding to the sample order in the dosage file. No header is included.
less -S sampleIDindosage.txt
  1. Model file from step 1
  2. Variance ratio file from step 1
./output/example.varianceRatio.txt
  1. Group file
    Each line is for one gene/set of variants. The first element is for gene/set name.
    The rest of the line is for variant ids included in this gene/set. For vcf/sav, the genetic marker ids are in the format chr:pos_ref/alt. For bgen, the genetic marker ids should match the ids in the bgen file. Each element in the line is separated by tab.
    ** Note that the order of variants in the group file needs to be consistent to variants' order in the dosage file ** To specify customized weights for variants, set weightsIncludeinGroupFile=TRUE and in the group file, for vcf/sav, the genetic marker ids and weights are in the format chr:pos_ref/alt;weight. For bgen, the genetic marker ids should match the ids in the bgen filE, e.g. SNPID;weight. Each element in the line is seperated by tab.**
    ** If weightsIncludeinGroupFile=TRUE, in conditional analysis, weights for conditioning markers need to be specified using the argument weights_for_G2_cond, e.g. weights_for_G2_cond=1,3,4. The weights in --weights_for_G2_cond are for markers in --condition, respectively.**
less -S ./input/groupFile_geneBasedtest.txt
  1. sparse Sigma file
sparseGRM_relatednessCutoff_0.125_1000_randomMarkersUsed.sparseSigma.mtx
Output files

  1. A file with region- or gene-based association test results
less -S ./output/example_quantitative.SAIGE.gene_conditional.txt

NOTE:

Gene : Gene name (as provided in the group file)
Pvalue: p-value of SKAT-O test
Pvalue_cond: conditional p-value of SKAT-O test (if --condition= is specified)
Nmarker_MACCate_n: number of markers in the gene falling in the nth MAC category (The MAC category is corresponding to the one used for the variance ratio estimation). For example, by default, Nmarker_MACCate_1 is the number of singletons
markerIDs: IDs for variants included in the test
markerAFs: allele frequencies for variants included in the test
Pvalue_Burden: p-value of Burden test
Pvalue_SKAT: p-value of SKAT test
Pvalue_Burden_cond: conditional p-value of Burden test (if --condition= is specified)
Pvalue_SKAT_cond: conditional p-value of SKAT test (if --condition= is specified)
Pvalue_SKAT.NA: p-values of SKAT test without accounting for unbalanced case-control ratios for binary traits
Pvalue.NA: p-values of SKAT-O test without accounting for unbalanced case-control ratios for binary traits
Pvalue_Burden.NA: p-values of Burden test without accounting for unbalanced case-control ratios for binary traits

  1. A file with single-variant association test results for genetic variants included in the gene-based tests (if --IsSingleVarinGroupTest=TRUE)
less -S ./output/example_quantitative.SAIGE.gene_conditional.txt_single

Same as the output by SAIGE for single-variant association tests