GitHub - dajiangliu/rareGWAMA (original) (raw)
A rareGWAMA R package
Table of Contents
Introduction
rareGWAMA is a flexible software package for imputation based GWAS meta-analysis.
It is developed and maintained by Dajiang Liu's Group.
Citation
Liu DJ*†, Peloso GM*, Zhan X*, Holmen O*, Zawistowski M, Feng S, Nikpay M, Auer PL, Goel A, Zhang H, Peters U, Farrall M, Orho-Melander M, Kooperberg C, McPherson R, Watkins H, Willer CJ, Hveem, K, Melander O, Kathiresan S, Abecasis GRâ€
Meta-analysis of gene-level tests of rare variant association, Nature Genetics, 46, 200–204 (2014)
doi: 10.1038/ng.2852.
Installing the rareGWAMA R package
The package is hosted on github, which allows installation and update to be very easy. First, make sure you have the mvtnorm and data.table packages installed:
install.packages("devtools")
And also, you need the latest version of seqminer:
library(devtools)
devtools::install_github("zhanxw/seqminer")
Then you could use:
install_github("dajiangliu/rareGWAMA")
With library(rareGWAMA), your are ready to go!
Quick tutorial
Single variant tests
1.The very basic test is using:
res <- rareGWAMA.single(score.stat.file=study, imp.qual.file =imp.qual, tabix.range="1:11000-58000", alternative="two.sided", col.impqual=5, impQual.lb=0, impQualWeight=FALSE, weight="Npq+impQ",gc=FALSE, rmMultiAllelicSite=TRUE);
please find more details in the input and arguments part for the arguments:
- score.stat.file: The file names of score statistic files, which could be a vector object;
- imp.qual.file: The file names of imputation quality, which could be a vector object;
- tabix.range: The tabix range, which must be in quote and provided as a string like this: "1:11000-58000";
- alternative: The alternative hypothesis. Default is two.sided;
- col.impqual: The column number for the imputation quality score;
- impQual.lb: The lower bound for the imputation quality. Variants with imputaiton quality less than impQual.lb will be labelled as missing;
- impQualWeight: Using imputation quality as weight;
- rmMultiAllelicSite: Default is TRUE. Multi-allelic sites will be removed from the analyses if set TRUE, and a variable posMulti will be output; The variant site with multiple alleles can be analyzed using rareGWAMA.single.multiAllele function;
2.The out put should be as follows:head(res$res.formatted))
POS REF ALT AF STAT PVALUE BETA SD N DIRECTION EFFECTIVE_N
[1,] "9:100000040" "G" "A" "2.28e-05" "1.01e+00" "3.16e-01" " 7.29e-01" " 0.72624" "41634" "XXXXXXXXXXXXXXXXXXXXXXXXXXXX++XX" "7925"
[2,] "9:100000056" "T" "TA" "7.07e-01" "1.62e+00" "2.04e-01" "-9.09e-03" " 0.00715" "47219" "XXXXXXXXXXXXXXXX-XXXXXXXXXXX--XX" "46951"
[3,] "9:100000102" "G" "C" "6.91e-04" "8.00e-01" "3.71e-01" "-7.78e-02" " 0.08691" "95869" "+++---+-------++XX-++--X++----XX" "36378"
[4,] "9:100000172" "C" "T" "2.26e-05" "2.25e-01" "6.35e-01" "-1.99e-01" " 0.41982" "125656" "XXXXXXXXXXXXXXXX-XXXXXXXXXXX--X+" "20434"
[5,] "9:100000177" "C" "G" "2.16e-04" "8.38e-01" "3.60e-01" "-1.04e-01" " 0.11335" "179891" "-+---X+XX-+--+-+-X-+-+-X++-+--X+" "89179"
[6,] "9:100000187" "A" "T" "1.13e-04" "6.78e-01" "4.10e-01" "-2.35e-01" " 0.28514" "54235" "+-++-XX+--++--+-XX+++++X----XXXX" "21285"
- For demo data, please see
?rareGWAMA.single.
Conditional single variant tests
1.The command should be like:
res <- rareGWAMA.cond.single(score.stat.file=study, imp.qual.file=imp.qual, vcf.ref.file="{$your_path}/ALL.chr9.phase3.genotypes.vcf.gz", candidateVar="9:97018619", knownVar="9:100000172", alternative="two.sided", col.impqual=5, impQual.lb=0, impQualWeight=FALSE, weight="Npq+impQ", gc=FALSE, rmMultiAllelicSite=TRUE);
please find more details in the input and arguments part for the arguments:
- score.stat.file: The file names of score statistic files, which could be a vector object;
- imp.qual.file: The file names of imputation quality, which could be a vector object;
- vcf.ref.file: the file names of the reference panel file, e.g. could be downloaded from 1000 Genomes Project.
- candidateVar: The tabix range;
- knownVar: The known variant;
- alternative: The alternative hypothesis. Default is two.sided;
- col.impqual: The column number for the imputation quality score;
- impQual.lb: The lower bound for the imputation quality. Variants with imputaiton quality less than impQual.lb will be labelled as missing;
- impQualWeight: Using imputation quality as weight;
- rmMultiAllelicSite: Default is TRUE. Multi-allelic sites will be removed from the analyses if set TRUE, and a variable posMulti will be output; The variant site with multiple alleles can be analyzed using rareGWAMA.single.multiAllele function;
2.The out put should be as follows:head(res$res.formatted))
POS REF ALT AF STAT PVALUE BETA SD N numStability
[1,] "9:97018619" "T" "C" "8.82e-05" "0.093" "0.76" "0.131" "0.428" "54235" "0"
- For demo data, please see
?rareGWAMA.cond.single.
Gene based tests
For more details, please see the SHOWCASE
1.The command should be like:
res.gene <- rareGWAMA.gene(score.stat.file, imp.qual.file=imp.qual.file, vcf.ref.file, refFileFormat="vcf.vbi", anno=anno, annoType=c('Nonsynonymous','Stop_Gain',"Essential_Splice_site"), rvtest='VT', ref.ancestry=ref.ancestry, trans.ethnic=TRUE, study.ancestry=study.ancestry, maf.cutoff=0.01);
please find more details in the input and arguments part for the arguments:
- score.stat.file: The file names of score statistic files, which could be a vector object;
- imp.qual.file: The file names of imputation quality, which could be a vector object;
- vcf.ref.file: The file names of the reference panel file, e.g. could be downloaded from 1000 Genomes Project.
- anno: Annotation file or list;
- annoType: The annotation types you want to use;
- rvtest: The method you want to use (i.e. 'VT', 'BURDEN', 'SKAT');
- ref.ancestry: The ancestry information for each sample;
- study.ancestry: The ancestry information for each study;
- maf.cutoff: The cutoff for the MAF, could be 0.01, 0.05, 0.001, whatever you want;
2.The out put should be as follows:
GENE RANGE STAT P-VALUE MAF_CUTOFF NUM_VAR TOTAL_MAF POS_VAR N POS_SINGLE_MINP BETA_SINGLE_MINP SD_SINGLE_MINP
MATN2 8:97931370-98033638 2.19 0.353 0.009472856879857 3 0.0123 8:97961468_G/A,8:98018087_G/A,8:98021213_G/A 204783 8:98021213_G/A 0.01813449913
STK3 8:98767360-98767360 0.0894 0.765 0.00582326555223991 1 0.00582 8:98767360_C/T 235921 8:98767360_C/T 0.00505283659608632 0.0168969355386225
VPS13B 8:99121478-99853608 0.305 0.752 0.00630135839199519 2 0.0117 8:99778930_G/A,8:99820031_A/G 283684 8:99778930_G/A 0.0114882148633614 0.016
COX6C 8:99892015-99892015 0.349 0.555 0.00244114819976761 1 0.00244 8:99892015_G/A 231499 8:99892015_G/A -0.0159448806371386 0.0269803627402012
Input files and arguments
👉 All the score statistics files and imputation quality files should be tabix indexed.
Say if you have such files: study1.gz(score statistics files), study2.gz, study1.R2.gz(imputation quality files), study2.R2.gzin your folder, and already added tabix to your $PATH environment variable, you could use:
for file in study*;do g=`zcat <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>f</mi><mi>i</mi><mi>l</mi><mi>e</mi><mi mathvariant="normal">∣</mi><mi>h</mi><mi>e</mi><mi>a</mi><mi>d</mi><mo>−</mo><mn>200</mn><mi mathvariant="normal">∣</mi><mi>g</mi><mi>r</mi><mi>e</mi><mi>p</mi><mo>−</mo><mi>n</mi><mi>C</mi><mi>H</mi><mi>R</mi><mi>O</mi><mi>M</mi><mi mathvariant="normal">∣</mi><mi>c</mi><mi>u</mi><mi>t</mi><mo>−</mo><mi>f</mi><mn>1</mn><mo>−</mo><mi>d</mi><mi mathvariant="normal">"</mi><mo>:</mo><mi mathvariant="normal">"</mi><mi mathvariant="normal">‘</mi><mo separator="true">;</mo><mi>t</mi><mi>a</mi><mi>b</mi><mi>i</mi><mi>x</mi><mo>−</mo><mi>f</mi><mo>−</mo><mi>S</mi></mrow><annotation encoding="application/x-tex">file | head -200 | grep -n CHROM | cut -f1 -d":"`; tabix -f -S </annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mord mathnormal" style="margin-right:0.10764em;">f</span><span class="mord mathnormal">i</span><span class="mord mathnormal" style="margin-right:0.01968em;">l</span><span class="mord mathnormal">e</span><span class="mord">∣</span><span class="mord mathnormal">h</span><span class="mord mathnormal">e</span><span class="mord mathnormal">a</span><span class="mord mathnormal">d</span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">−</span><span class="mspace" style="margin-right:0.2222em;"></span></span><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mord">200∣</span><span class="mord mathnormal" style="margin-right:0.03588em;">g</span><span class="mord mathnormal">re</span><span class="mord mathnormal">p</span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">−</span><span class="mspace" style="margin-right:0.2222em;"></span></span><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mord mathnormal">n</span><span class="mord mathnormal" style="margin-right:0.07153em;">C</span><span class="mord mathnormal" style="margin-right:0.08125em;">H</span><span class="mord mathnormal" style="margin-right:0.10903em;">ROM</span><span class="mord">∣</span><span class="mord mathnormal">c</span><span class="mord mathnormal">u</span><span class="mord mathnormal">t</span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">−</span><span class="mspace" style="margin-right:0.2222em;"></span></span><span class="base"><span class="strut" style="height:0.8889em;vertical-align:-0.1944em;"></span><span class="mord mathnormal" style="margin-right:0.10764em;">f</span><span class="mord">1</span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">−</span><span class="mspace" style="margin-right:0.2222em;"></span></span><span class="base"><span class="strut" style="height:0.6944em;"></span><span class="mord mathnormal">d</span><span class="mord">"</span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">:</span><span class="mspace" style="margin-right:0.2778em;"></span></span><span class="base"><span class="strut" style="height:0.8889em;vertical-align:-0.1944em;"></span><span class="mord">"‘</span><span class="mpunct">;</span><span class="mspace" style="margin-right:0.1667em;"></span><span class="mord mathnormal">t</span><span class="mord mathnormal">abi</span><span class="mord mathnormal">x</span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">−</span><span class="mspace" style="margin-right:0.2222em;"></span></span><span class="base"><span class="strut" style="height:0.8889em;vertical-align:-0.1944em;"></span><span class="mord mathnormal" style="margin-right:0.10764em;">f</span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">−</span><span class="mspace" style="margin-right:0.2222em;"></span></span><span class="base"><span class="strut" style="height:0.6833em;"></span><span class="mord mathnormal" style="margin-right:0.05764em;">S</span></span></span></span>g -s 1 -b 2 -e 2 $file; done
to tabix each file.
Score statistics files:
If you use 1 RVTESTS, the output is ready to go:
CHROM POS REF ALT N_INFORMATIVE AF INFORMATIVE_ALT_AC CALL_RATE HWE_PVALUE N_REF N_HET N_ALT U_STAT SQRT_V_STAT ALT_EFFSIZE PVALUE
1 10177 A AC 2352 0.5 2352 1 0 0 2352 0 1.67496 2.51553 0.264695 0.505508
1 10235 T TA 2352 0 0 1 1 2352 0 0 -0.108472 0.207841 -2.51104 0.601742
1 10352 T TA 2352 0.5 2352 1 0 0 2352 0 0.665562 2.61389 0.0974122 0.799013
1 10539 C A 2352 0 0 1 1 2352 0 0 -0.00020902 0.0626305 -0.0532862 0.997337
Imputation quality files:
An example file has the following format:
CHROM POS REF ALT Rsq
1 10177 A AC 0.00581
1 10235 T TA 0.00396
1 10352 T TA 0.00608
1 10539 C A 0.00154
1 10616 CCGCCGTTGCAAAGGCGCGCCG C 0.02085
1 10642 G A 0.00013reference allele
There are five columns, which are Chromosome #, Position, Reference allele, Alternative allele and R square quality. The higher of R square quality(the 5th column) value, the better of the genotype imputation quality.
VCF reference files:
The file names of the reference panel file.
You could download the files from 1000 Genomes Project: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502.
Also, you could subset the ranges you want by using tabix, with command looks like:
tabix -h ALL.2of4intersection.20100804.genotypes.vcf.gz 2:39967768-39967768
For more details, please see: How do I get a sub-section of a VCF file?;
Annotation files:
chrom pos ref alt af anno gene
112207 1 32247432 G A 0.03425920 Nonsynonymous FAM167B
112208 1 32247598 A G 0.07993410 Exon FAM167B
112209 1 32247709 G T 0.00645166 Intron FAM167B
112210 1 32248405 G C 0.00290383 Nonsynonymous FAM167B
140157 1 41479019 C T 0.09314330 Utr3 EDN2
140158 1 41479369 C T 0.06422470 Utr3 EDN2
Ancestry files:
1.ref.ancestry.
You should have a original file (i.e. ref.ancestry.ori) as:
head(ref.ancestry.ori)
fid ancestry study
1 samp1 HIM HCHS
2 samp2 HIM HCHS
3 samp3 HIM HCHS
4 samp4 HIM HCHS
5 samp5 HIM HCHS
6 samp6 HIM HCHS
Then, you use:ref.ancestry <- cbind(ref.ancestry.ori[,1], paste(ref.ancestry.ori[,2], ref.ancestry.ori[,3], sep=","))
So, the final format should be a matrix:
[,1] [,2]
[1,] "samp1" "HIM,HCHS"
[2,] "samp2" "HIM,HCHS"
[3,] "samp3" "HIM,HCHS"
[4,] "samp4" "HIM,HCHS"
[5,] "samp5" "HIM,HCHS"
[6,] "samp6" "HIM,HCHS"
2.study.ancestry.
Just a vector, as
[1] "AACAC" "AMISH" "ARIC" "BAGS" "CFS" "CHS"
Feedback/Contact
Questions and requests can be sent to Github issue page (link) or Dajiang Liu (dajiang.liu@outlook.com) and Fang Chen(fchen1@hmc.psu.edu)
References
1: Xiaowei Zhan, Youna Hu, Bingshan Li, Goncalo R. Abecasis, and Dajiang J. Liu
RVTESTS: An Efficient and Comprehensive Tool for Rare Variant Association Analysis Using Sequence Data
Bioinformatics 2016 32: 1423-1426. doi:10.1093/bioinformatics/btw079 (PDF)