Imputation and quality control steps for combining multiple genome-wide datasets - PubMed (original) (raw)

doi: 10.3389/fgene.2014.00370. eCollection 2014.

Mariza de Andrade 2, Gerard Tromp 3, Helena Kuivaniemi 3, Elizabeth Pugh 4, Bahram Namjou-Khales 5, Shubhabrata Mukherjee 6, Gail P Jarvik 6, Leah C Kottyan 5, Amber Burt 6, Yuki Bradford 1, Gretta D Armstrong 1, Kimberly Derr 3, Dana C Crawford 7, Jonathan L Haines 8, Rongling Li 9, David Crosslin 6, Marylyn D Ritchie 1

Affiliations

PMID: 25566314
PMCID: PMC4263197
DOI: 10.3389/fgene.2014.00370

Imputation and quality control steps for combining multiple genome-wide datasets

Shefali S Verma et al. Front Genet. 2014.

Abstract

The electronic MEdical Records and GEnomics (eMERGE) network brings together DNA biobanks linked to electronic health records (EHRs) from multiple institutions. Approximately 51,000 DNA samples from distinct individuals have been genotyped using genome-wide SNP arrays across the nine sites of the network. The eMERGE Coordinating Center and the Genomics Workgroup developed a pipeline to impute and merge genomic data across the different SNP arrays to maximize sample size and power to detect associations with a variety of clinical endpoints. The 1000 Genomes cosmopolitan reference panel was used for imputation. Imputation results were evaluated using the following metrics: accuracy of imputation, allelic R (2) (estimated correlation between the imputed and true genotypes), and the relationship between allelic R (2) and minor allele frequency. Computation time and memory resources required by two different software packages (BEAGLE and IMPUTE2) were also evaluated. A number of challenges were encountered due to the complexity of using two different imputation software packages, multiple ancestral populations, and many different genotyping platforms. We present lessons learned and describe the pipeline implemented here to impute and merge genomic data sets. The eMERGE imputed dataset will serve as a valuable resource for discovery, leveraging the clinical data that can be mined from the EHR.

Keywords: eMERGE; electronic health records; genome-wide association; imputation.

PubMed Disclaimer

Figures

Figure 1

Chromosome segmentation strategy for genome-wide imputation with BEAGLE. Each chromosome was divided into SNPlets which included 30,000 SNPs with a buffer of 700 SNPs at each end.

Figure 2

Chromosome segmentation strategy for imputation with IMPUTE2. Each chromosome was divided into 6 MB segments with 250 kbp overlap between them.

Figure 3

Workflow and performance metrics for imputation with BEAGLE and IMPUTE2.

Figure 4

Frequency distribution of “info” quality metric (A,B) and relationship between the “info” score and MAF are shown (C,D). The secondary axis indicates the count of SNPs in each MAF bin (0.01 intervals).

Figure 5

Best practices for analyzing imputed data.

Figure 6

Summary on principal component (PC) analysis for adult DNA samples. (A) PC1 and PC2 colored by self-reported race (AA, African American; EA, European American; HA, Hispanic, Others and -9, missing), (B) PC1 and PC2 colored by site, (C) Variance explained by first 10 PCs.

Figure 7

Summary on principal component (PC) analysis for pediatric DNA samples. (A) PC1 and PC2 colored by self-reported race (AA, African American; EA, European American; HA, Hispanic and Others), (B) PC1 and PC2 colored by site, (C) Variance explained by first 10 PCs.

References

1. E pluribus unum (2010). Nat. Methods 7, 331–331 10.1038/nmth0510-331 - DOI - PubMed
1. Aulchenko Y. S., Struchalin M. V., van Duijn C. M. (2010). ProbABEL package for genome-wide association analysis of imputed data. BMC Bioinformatics 11:134. 10.1186/1471-2105-11-134 - DOI - PMC - PubMed
1. Browning B. L., Browning S. R. (2009). A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals. Am. J. Hum. Genet. 84, 210–223. 10.1016/j.ajhg.2009.01.005 - DOI - PMC - PubMed
1. Browning S. R. (2008). Missing data imputation and haplotype phase inference for genome-wide association studies. Hum. Genet. 124, 439–450. 10.1007/s00439-008-0568-7 - DOI - PMC - PubMed
1. Crosslin D. R., Tromp G., Burt A., Kim D. S., Verma S. S., Lucas A. M., et al. (2014). Controlling for population structure and genotyping platform bias in the eMERGE multi-institutional biobank linked to Electronic Health Records. Front. Genet. 5:352 10.3389/fgene.2014.00352 - DOI - PMC - PubMed

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations