Population Structure and Relatedness Inference using the GENESIS Package (original) (raw)
Contents
- 1 Overview
- 2 Data
- 3 Principal Components Analysis in Related Samples (PC-AiR)
- 4 Relatedness Estimation Adjusted for Principal Components (PC-Relate)
- 5 References
- Appendix
Overview
GENESIS provides statistical methodology for analyzing genetic data from samples with population structure and/or familial relatedness. This vignette provides a description of how to use GENESIS for inferring population structure, as well as estimating relatedness measures such as kinship coefficients, identity by descent (IBD) sharing probabilities, and inbreeding coefficients. GENESIS uses PC-AiR for population structure inference that is robust to known or cryptic relatedness, and it uses PC-Relate for accurate relatedness estimation in the presence of population structure, admixutre, and departures from Hardy-Weinberg equilibrium.
Data
Reading in Genotype Data
The functions in the GENESIS package can read genotype data from a GenotypeData
class object as created by the GWASTools package. Through the use of GWASTools, a GenotypeData
class object can easily be created from:
- an R matrix of SNP genotype data
- a GDS file
- PLINK files
Example R code for creating a GenotypeData
object is presented below. Much more detail can be found in the GWASTools package reference manual.
GENESIS can also work with genotype data from sequencing, starting with a VCF file. For examples using this format, see the vignette “Analyzing Sequence Data using the GENESIS Package”.
R Matrix
geno <- MatrixGenotypeReader(genotype = genotype, snpID = snpID,
chromosome = chromosome, position = position,
scanID = scanID)
genoData <- GenotypeData(geno)
genotype
is a matrix of genotype values coded as 0 / 1 / 2, where rows index SNPs and columns index samplessnpID
is an integer vector of unique SNP IDschromosome
is an integer vector specifying the chromosome of each SNPposition
is an integer vector specifying the position of each SNPscanID
is a vector of unique individual IDs
GDS files
geno <- GdsGenotypeReader(filename = "genotype.gds")
genoData <- GenotypeData(geno)
filename
is the file path to the GDS object
PLINK files
The SNPRelate package provides the snpgdsBED2GDS
function to convert binary PLINK files into a GDS file.
snpgdsBED2GDS(bed.fn = "genotype.bed",
bim.fn = "genotype.bim",
fam.fn = "genotype.fam",
out.gdsfn = "genotype.gds")
bed.fn
is the file path to the PLINK .bed filebim.fn
is the file path to the PLINK .bim filefam.fn
is the file path to the PLINK .fam fileout.gdsfn
is the file path for the output GDS file
Once the PLINK files have been converted to a GDS file, then a GenotypeData
object can be created as described above.
HapMap Data
To demonstrate PC-AiR and PC-Relate analyses with the GENESIS package, we analyze SNP data from the Mexican Americans in Los Angeles, California (MXL) and African American individuals in the southwestern USA (ASW) population samples of HapMap 3. Mexican Americans and African Americans have a diverse ancestral background, and familial relatives are present in these data. Genotype data at a subset of 20K autosomal SNPs for 173 individuals are provided as a GDS file.
gdsfile <- system.file("extdata", "HapMap_ASW_MXL_geno.gds", package="GENESIS")
References
Appendix
- Conomos M.P., Reiner A.P., Weir B.S., & Thornton T.A. (2016). Model-free Estimation of Recent Genetic Relatedness. American Journal of Human Genetics, 98(1), 127-148.
- Conomos M.P., Miller M.B., & Thornton T.A. (2015). Robust Inference of Population Structure for Ancestry Prediction and Correction of Stratification in the Presence of Relatedness. Genetic Epidemiology, 39(4), 276-293.
- Gogarten, S.M., Bhangale, T., Conomos, M.P., Laurie, C.A., McHugh, C.P., Painter, I., … & Laurie, C.C. (2012). GWASTools: an R/Bioconductor package for quality control and analysis of Genome-Wide Association Studies. Bioinformatics, 28(24), 3329-3331.
- Manichaikul, A., Mychaleckyj, J.C., Rich, S.S., Daly, K., Sale, M., & Chen, W.M. (2010). Robust relationship inference in genome-wide association studies. Bioinformatics, 26(22), 2867-2873.