GitHub - dajiangliu/rareGWAMA (original) (raw)

A rareGWAMA R package

Table of Contents

Introduction
Citation
Installation
Quick tutorial
Input files and arguments
Feedback/Contact

Introduction

rareGWAMA is a flexible software package for imputation based GWAS meta-analysis.
It is developed and maintained by Dajiang Liu's Group.

Citation

Liu DJ*†, Peloso GM*, Zhan X*, Holmen O*, Zawistowski M, Feng S, Nikpay M, Auer PL, Goel A, Zhang H, Peters U, Farrall M, Orho-Melander M, Kooperberg C, McPherson R, Watkins H, Willer CJ, Hveem, K, Melander O, Kathiresan S, Abecasis GR†
Meta-analysis of gene-level tests of rare variant association, Nature Genetics, 46, 200–204 (2014)
doi: 10.1038/ng.2852.

Installing the rareGWAMA R package

The package is hosted on github, which allows installation and update to be very easy. First, make sure you have the mvtnorm and data.table packages installed:

install.packages("devtools")

And also, you need the latest version of seqminer:

library(devtools)
devtools::install_github("zhanxw/seqminer")

Then you could use:

install_github("dajiangliu/rareGWAMA")

With library(rareGWAMA), your are ready to go!

Quick tutorial

Single variant tests

1.The very basic test is using:

res <- rareGWAMA.single(score.stat.file=study, imp.qual.file =imp.qual, tabix.range="1:11000-58000", alternative="two.sided", col.impqual=5, impQual.lb=0, impQualWeight=FALSE, weight="Npq+impQ",gc=FALSE, rmMultiAllelicSite=TRUE);

please find more details in the input and arguments part for the arguments:

score.stat.file: The file names of score statistic files, which could be a vector object;

imp.qual.file: The file names of imputation quality, which could be a vector object;

tabix.range: The tabix range, which must be in quote and provided as a string like this: "1:11000-58000";

alternative: The alternative hypothesis. Default is two.sided;

col.impqual: The column number for the imputation quality score;

impQual.lb: The lower bound for the imputation quality. Variants with imputaiton quality less than impQual.lb will be labelled as missing;

impQualWeight: Using imputation quality as weight;

rmMultiAllelicSite: Default is TRUE. Multi-allelic sites will be removed from the analyses if set TRUE, and a variable posMulti will be output; The variant site with multiple alleles can be analyzed using rareGWAMA.single.multiAllele function;

2.The out put should be as follows:
head(res$res.formatted))

     POS           REF ALT  AF         STAT       PVALUE     BETA        SD         N        DIRECTION                          EFFECTIVE_N
[1,] "9:100000040" "G" "A"  "2.28e-05" "1.01e+00" "3.16e-01" " 7.29e-01" " 0.72624" "41634"  "XXXXXXXXXXXXXXXXXXXXXXXXXXXX++XX" "7925"
[2,] "9:100000056" "T" "TA" "7.07e-01" "1.62e+00" "2.04e-01" "-9.09e-03" " 0.00715" "47219"  "XXXXXXXXXXXXXXXX-XXXXXXXXXXX--XX" "46951"
[3,] "9:100000102" "G" "C"  "6.91e-04" "8.00e-01" "3.71e-01" "-7.78e-02" " 0.08691" "95869"  "+++---+-------++XX-++--X++----XX" "36378"
[4,] "9:100000172" "C" "T"  "2.26e-05" "2.25e-01" "6.35e-01" "-1.99e-01" " 0.41982" "125656" "XXXXXXXXXXXXXXXX-XXXXXXXXXXX--X+" "20434"
[5,] "9:100000177" "C" "G"  "2.16e-04" "8.38e-01" "3.60e-01" "-1.04e-01" " 0.11335" "179891" "-+---X+XX-+--+-+-X-+-+-X++-+--X+" "89179"
[6,] "9:100000187" "A" "T"  "1.13e-04" "6.78e-01" "4.10e-01" "-2.35e-01" " 0.28514" "54235"  "+-++-XX+--++--+-XX+++++X----XXXX" "21285"

For demo data, please see ?rareGWAMA.single.

Conditional single variant tests

1.The command should be like:

res <- rareGWAMA.cond.single(score.stat.file=study, imp.qual.file=imp.qual, vcf.ref.file="{$your_path}/ALL.chr9.phase3.genotypes.vcf.gz", candidateVar="9:97018619", knownVar="9:100000172", alternative="two.sided", col.impqual=5, impQual.lb=0, impQualWeight=FALSE, weight="Npq+impQ", gc=FALSE, rmMultiAllelicSite=TRUE);

please find more details in the input and arguments part for the arguments:

score.stat.file: The file names of score statistic files, which could be a vector object;

imp.qual.file: The file names of imputation quality, which could be a vector object;

vcf.ref.file: the file names of the reference panel file, e.g. could be downloaded from 1000 Genomes Project.

candidateVar: The tabix range;

knownVar: The known variant;

alternative: The alternative hypothesis. Default is two.sided;

col.impqual: The column number for the imputation quality score;

impQual.lb: The lower bound for the imputation quality. Variants with imputaiton quality less than impQual.lb will be labelled as missing;

impQualWeight: Using imputation quality as weight;

rmMultiAllelicSite: Default is TRUE. Multi-allelic sites will be removed from the analyses if set TRUE, and a variable posMulti will be output; The variant site with multiple alleles can be analyzed using rareGWAMA.single.multiAllele function;

2.The out put should be as follows:
head(res$res.formatted))

    POS          REF ALT AF         STAT    PVALUE BETA    SD      N       numStability
[1,] "9:97018619" "T" "C" "8.82e-05" "0.093" "0.76" "0.131" "0.428" "54235" "0"

For demo data, please see ?rareGWAMA.cond.single.

Gene based tests

For more details, please see the SHOWCASE

1.The command should be like:

res.gene <- rareGWAMA.gene(score.stat.file, imp.qual.file=imp.qual.file, vcf.ref.file, refFileFormat="vcf.vbi", anno=anno, annoType=c('Nonsynonymous','Stop_Gain',"Essential_Splice_site"), rvtest='VT', ref.ancestry=ref.ancestry, trans.ethnic=TRUE, study.ancestry=study.ancestry, maf.cutoff=0.01);

please find more details in the input and arguments part for the arguments:

score.stat.file: The file names of score statistic files, which could be a vector object;

imp.qual.file: The file names of imputation quality, which could be a vector object;

vcf.ref.file: The file names of the reference panel file, e.g. could be downloaded from 1000 Genomes Project.

anno: Annotation file or list;

annoType: The annotation types you want to use;

rvtest: The method you want to use (i.e. 'VT', 'BURDEN', 'SKAT');

ref.ancestry: The ancestry information for each sample;

study.ancestry: The ancestry information for each study;

maf.cutoff: The cutoff for the MAF, could be 0.01, 0.05, 0.001, whatever you want;

2.The out put should be as follows:

GENE    RANGE   STAT    P-VALUE MAF_CUTOFF      NUM_VAR TOTAL_MAF       POS_VAR N       POS_SINGLE_MINP BETA_SINGLE_MINP        SD_SINGLE_MINP
MATN2   8:97931370-98033638     2.19    0.353   0.009472856879857       3       0.0123  8:97961468_G/A,8:98018087_G/A,8:98021213_G/A    204783  8:98021213_G/A  0.01813449913
STK3    8:98767360-98767360     0.0894  0.765   0.00582326555223991     1       0.00582 8:98767360_C/T  235921  8:98767360_C/T  0.00505283659608632     0.0168969355386225
VPS13B  8:99121478-99853608     0.305   0.752   0.00630135839199519     2       0.0117  8:99778930_G/A,8:99820031_A/G   283684  8:99778930_G/A  0.0114882148633614      0.016
COX6C   8:99892015-99892015     0.349   0.555   0.00244114819976761     1       0.00244 8:99892015_G/A  231499  8:99892015_G/A  -0.0159448806371386     0.0269803627402012

Input files and arguments

👉 All the score statistics files and imputation quality files should be tabix indexed.
Say if you have such files: study1.gz(score statistics files), study2.gz, study1.R2.gz(imputation quality files), study2.R2.gzin your folder, and already added tabix to your $PATH environment variable, you could use:

for file in study*;do g=`zcat <math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>f</mi><mi>i</mi><mi>l</mi><mi>e</mi><mi mathvariant="normal">∣</mi><mi>h</mi><mi>e</mi><mi>a</mi><mi>d</mi><mo>−</mo><mn>200</mn><mi mathvariant="normal">∣</mi><mi>g</mi><mi>r</mi><mi>e</mi><mi>p</mi><mo>−</mo><mi>n</mi><mi>C</mi><mi>H</mi><mi>R</mi><mi>O</mi><mi>M</mi><mi mathvariant="normal">∣</mi><mi>c</mi><mi>u</mi><mi>t</mi><mo>−</mo><mi>f</mi><mn>1</mn><mo>−</mo><mi>d</mi><mi mathvariant="normal">"</mi><mo>:</mo><mi mathvariant="normal">"</mi><mi mathvariant="normal">‘</mi><mo separator="true">;</mo><mi>t</mi><mi>a</mi><mi>b</mi><mi>i</mi><mi>x</mi><mo>−</mo><mi>f</mi><mo>−</mo><mi>S</mi></mrow><annotation encoding="application/x-tex">file | head -200 | grep -n CHROM | cut -f1 -d":"`; tabix -f -S </annotation></semantics></math>file∣head−200∣grep−nCHROM∣cut−f1−d":"‘;tabix−f−Sg -s 1 -b 2 -e 2 $file; done

to tabix each file.

Score statistics files:

If you use 1 RVTESTS, the output is ready to go:

CHROM   POS     REF     ALT     N_INFORMATIVE   AF      INFORMATIVE_ALT_AC      CALL_RATE       HWE_PVALUE      N_REF   N_HET   N_ALT   U_STAT  SQRT_V_STAT     ALT_EFFSIZE     PVALUE
1       10177   A       AC      2352    0.5     2352    1       0       0       2352    0       1.67496 2.51553 0.264695        0.505508
1       10235   T       TA      2352    0       0       1       1       2352    0       0       -0.108472       0.207841        -2.51104        0.601742
1       10352   T       TA      2352    0.5     2352    1       0       0       2352    0       0.665562        2.61389 0.0974122       0.799013
1       10539   C       A       2352    0       0       1       1       2352    0       0       -0.00020902     0.0626305       -0.0532862      0.997337

Imputation quality files:

An example file has the following format:

CHROM   POS     REF     ALT     Rsq
1       10177   A       AC      0.00581
1       10235   T       TA      0.00396
1       10352   T       TA      0.00608
1       10539   C       A       0.00154
1       10616   CCGCCGTTGCAAAGGCGCGCCG  C       0.02085
1       10642   G       A       0.00013reference allele

There are five columns, which are Chromosome #, Position, Reference allele, Alternative allele and R square quality. The higher of R square quality(the 5th column) value, the better of the genotype imputation quality.

VCF reference files:

The file names of the reference panel file.
You could download the files from 1000 Genomes Project: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502.
Also, you could subset the ranges you want by using tabix, with command looks like:

tabix -h ALL.2of4intersection.20100804.genotypes.vcf.gz 2:39967768-39967768

For more details, please see: How do I get a sub-section of a VCF file?;

Annotation files:

       chrom      pos ref alt         af          anno    gene
112207     1 32247432   G   A 0.03425920 Nonsynonymous FAM167B
112208     1 32247598   A   G 0.07993410          Exon FAM167B
112209     1 32247709   G   T 0.00645166        Intron FAM167B
112210     1 32248405   G   C 0.00290383 Nonsynonymous FAM167B
140157     1 41479019   C   T 0.09314330          Utr3    EDN2
140158     1 41479369   C   T 0.06422470          Utr3    EDN2

Ancestry files:

1.ref.ancestry.

You should have a original file (i.e. ref.ancestry.ori) as:

head(ref.ancestry.ori)
        fid ancestry study
1 samp1      HIM  HCHS
2 samp2      HIM  HCHS
3 samp3      HIM  HCHS
4 samp4      HIM  HCHS
5 samp5      HIM  HCHS
6 samp6      HIM  HCHS

Then, you use:ref.ancestry <- cbind(ref.ancestry.ori[,1], paste(ref.ancestry.ori[,2], ref.ancestry.ori[,3], sep=","))

So, the final format should be a matrix:

     [,1]        [,2]
[1,] "samp1" "HIM,HCHS"
[2,] "samp2" "HIM,HCHS"
[3,] "samp3" "HIM,HCHS"
[4,] "samp4" "HIM,HCHS"
[5,] "samp5" "HIM,HCHS"
[6,] "samp6" "HIM,HCHS"

2.study.ancestry.

Just a vector, as

[1] "AACAC" "AMISH" "ARIC"  "BAGS"  "CFS"   "CHS"

Feedback/Contact

Questions and requests can be sent to Github issue page (link) or Dajiang Liu (dajiang.liu@outlook.com) and Fang Chen(fchen1@hmc.psu.edu)

References

1: Xiaowei Zhan, Youna Hu, Bingshan Li, Goncalo R. Abecasis, and Dajiang J. Liu
RVTESTS: An Efficient and Comprehensive Tool for Rare Variant Association Analysis Using Sequence Data
Bioinformatics 2016 32: 1423-1426. doi:10.1093/bioinformatics/btw079 (PDF)