IMPUTE2 (original) (raw)
IMPUTE version 2 (also known as IMPUTE2) is a genotype imputation and haplotype phasing program based on ideas from Howie et al. 2009:
B. N. Howie, P. Donnelly, and J. Marchini (2009) A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genetics 5(6): e1000529 [Open Access Article] [Supplementary Material]
IMPUTE2 also includes features that were introduced in other publications, which you can find here.
The figure below shows the most common scenario in which imputation is used: unobserved genotypes (red question marks) in a set of study individuals are imputed (or predicted) using a set of reference haplotypes and genotypes from a SNP chip.
Getting Started
IMPUTE2 is a computer program for phasing observed genotypes and imputing missing genotypes. Most people use just a couple of the program's basic functions, but we have also built up a collection of specialized and powerful options. If you are new to IMPUTE2, or indeed to phasing and imputation in general, we suggest that you start by learning the basics.
You should begin by downloading the program from here. You will need to choose the link that matches your computing platform and then follow the instructions for opening the download package.
Once you have done this, you will be ready to try some example analyses on the test data that are provided with the download. The section on Examples shows how to use the most common IMPUTE2 functions. We suggest that you work through these examples and try to understand what the elements of each command are doing. If you don't understand something or would like to know if the program can perform a function that isn't listed, you can read our FAQ or submit a question to our mail list.
When you have learned the basic functionality of the program, you can use several features of this website to prepare your own analysis:
- Learn about best practices for imputation.
- Download reference data that you can use to impute genotypes in your study.
- Look through a complete list of program options.
What's New?
New release (23 December 2014)
We have just released IMPUTE v2.3.2. This version is a very minor update to add additional columns that report the two alleles at each imputed variant to the info files.
New release (16 June 2014)
We have just released IMPUTE v2.3.1. This version fixes a bug in panel-merging functionality that caused variants seen in one of two reference panels to be imputed with a fixed allele (non-ref allele in Panel 0, ref allele in Panel 1). If you have used these options, then we would recommend re-running your imputation.
In addition, we have released a new version of the 1000 Genomes Phase 1 haplotypes. These are an updated version of the haplotypes released on 9 Dec 2013. It seems we had not completely resolved the strand flip issue with the previous release (see below), and a further 199 SNPs needed to be corrected in this new release.
The new haplotypes are available here.
New release (9 Dec 2013)
We have released a new version of the 1000 Genomes Phase 1 haplotypes. These are an updated version of the haplotypes released on 16 Sept 2013. There was a small problem with the strand of the Illumina OMNI data we used as the scaffold. 730 SNPs across the genome were not aligned to the + strand of the human genome reference. This does not affect the phasing of the haplotypes, but does affect downstream imputation, especially if these SNPs were genotyped directly in the study being imputed. The new haplotypes were not re-phased. We just switched the strand of the 730 affected SNPs.
The new haplotypes are available here.
New release (16 Sept 2013)
We have released a new version of the 1000 Genomes Phase 1 haplotypes. The haplotypes were phased using a new version of SHAPEIT2 that can handle genotype likelihoods and genotypes available from microarrays on the same samples. Using a set of validation genotypes at SNP and biallelic indels we have been able to show that these haplotypes have lower genotype discordance and improved imputation performance into downstream GWAS samples, especially at low frequency variants.
The new haplotypes are available here.
New software release (04 Jan 2013)
We have just released IMPUTE v2.3.0, which includes a number of new features and minor bug fixes. One valuable new function is a simple and robust approach for merging reference panels; for example, it is easy to combine 1,000 Genomes haplotypes with population-specific sequence data to capture the strength of both reference sets. We have also written detailed documentation for the concordance tables printed at the end of most IMPUTE2 runs.
Paper on "pre-phasing" study genotypes for faster imputation
We recently published an article called "Fast and accurate genotype imputation in genome-wide association studies through pre-phasing" in Nature Genetics. This paper describes a strategy ("pre-phasing") for efficient genotype imputation with large reference panels. By reducing the computational burden of imputation, pre-phasing makes imputation-based studies feasible for groups with limited computing power, and it also makes it easier to re-impute existing GWAS datasets as more informative reference panels become available. You can learn more about pre-phasing with IMPUTE2 here.
Latest 1,000 Genomes Phase I reference panel
In March 2012, the 1,000 Genomes Project released a powerful reference panel known as "Phase I version 3". In August 2012, we modified this panel by excluding variants with only one copy of the minor allele (singletons) across all 1,092 individuals. Singleton variants are difficult to impute, yet they make up ~20% of all variants in the reference panel; removing them makes imputation faster without hurting the power for association mapping. You can download either the orginal reference panel or the modified version (which is labeled "macGT1" for "minor allele count greater than one") here.
Paper on imputation strategies for ancestrally diverse reference panels
We published an article called "Genotype imputation with thousands of genomes" in the open-access journal G3: Genes, Genomes, Genetics. This paper describes our strategy for achieving high accuracy with ancestrally diverse reference panels, especially at low-frequency variants and in admixed study cohorts: we supply a cosmopolitan set of reference haplotypes to IMPUTE2, which can automatically find the most useful ones for each study individual with the help of the tuning parameter -k_hap. You can read more about the results that support this strategy in the article, and we provide practical suggestions for applying it here.
Pre-phasing with SHAPEIT
IMPUTE2's pre-phasing approach now works with phased haplotypes from SHAPEIT, a highly accurate phasing algorithm that can handle mixtures of unrelateds, duos, and trios. Details are available here. We highly recommend using SHAPEIT to infer the haplotypes underlying your study genotypes, then passing these to IMPUTE2 for imputation as shown in the second step ofthis example.
Download IMPUTE2
IMPUTE2 is freely available for academic use. To see rules for non-academic use, please read the LICENCE file, which is included with each software download.
Pre-compiled IMPUTE2 binaries and example files can be downloaded from the links below. For Linux machines, the dynamic binaries are smaller but may not work on some machines due to gcc library compatibility issues; if the dynamic version doesn't work for you, please try the static version. If you have any problems getting the program to work on your machine or would like to request an executable for a platform not shown here, please send a message to our mail list.
The latest software release is v2.3.1. We support only the most recent version.
Platform | File |
---|---|
Linux (x86_64) Static Executable | impute_v2.3.2_x86_64_static.tgz |
Linux (x86_64) Dynamic Executable | impute_v2.3.2_x86_64_dynamic.tgz |
Mac OSX Intel | impute_v2.3.2_MacOSX_Intel.tgz |
Windows MS-DOS (Intel) | impute_v2.3.1_Windows.tgz (coming soon) |
Solaris 5.10 | impute_v2.3.2_Solaris5.10.tar.gz (coming soon) |
To unpack the files on a Linux computer, use a command like this:
tar -zxvf impute_v2.X.Y_i386.tgz
(Other file decompression programs are available for non-Linux computers.) This will create a directory of the same name as the downloaded file, minus the '.tgz' suffix. Inside this directory you will find an executable called impute2, a LICENCE file, and an Example/ directory that contains example data files. We show how to perform various kinds of analyses with the example files here.
Download Reference Data
IMPUTE2 can use publicly available reference datasets, such as haplotypes from major sequencing projects, as well as customized reference panels, such as SNP genotypes from a fine-mapping study. If you would like to download a public dataset, just click the relevant link below, which will take you to a page with background information and download options for that dataset.
Link to download page | NCBI build | Haplotype release date | Release status |
---|---|---|---|
1000 Genomes Phase 3 | b37 | October 2014 | |
1000 Genomes Phase I integrated haplotypes (produced using SHAPEIT2) | b37 | June 2014 | |
1000 Genomes Phase I integrated haplotypes (produced using SHAPEIT2) | b37 | Dec 2013 | |
1000 Genomes Phase I integrated haplotypes (produced using SHAPEIT2) | b37 | Sep 2013 | |
1000 Genomes Phase I integrated variant set | b37 | Mar 2012 | Includes chrX; updated 24 Aug 2012 |
1000 Genomes Phase I (interim) | b37 | Jun 2011 | Includes chrX; updated 19 Apr 2012 |
1000 Genomes (2010 interim) | b37 | Dec 2010 | |
1000 Genomes Pilot + HapMap 3 | b36 | Jun 2010 / Feb 2009 | |
1000 Genomes Pilot | b36 | Jun 2010 | |
HapMap 3 (release #2) | b36 | Feb 2009 | Includes chrX |
HapMap 2 (release #24) | b36 | Oct 2008 | |
HapMap 2 (release #22) | b36 | Jan 2008 | |
HapMap 2 (release #21) | b35 | Jul 2006 |
Using Multi-Population Reference Panels
Overview
Human genetic variation resources, like those produced by HapMap 3 and the 1,000 Genomes Project, capture a broad cross-section of human genetic diversity: detailed variation data have now been collected from a variety of sampling locations in Africa, Asia, Europe, and the Americas. Large sequencing projects are actively expanding these datasets to include additional populations and deeper sampling within populations. These public databases provide powerful reference panels for genotype imputation studies.
In this context, one important question is how to choose a reference panel that will produce high imputation accuracy in a population of interest. The answer is seldom obvious because human populations have experienced complex demographic histories with many migration and mixture events. Consequently, it can be hard to decide which reference haplotypes should be used in a particular study.
We have proposed a simple and universal solution to this problem: we provide all available reference haplotypes to IMPUTE2, then let the software choose a "custom" reference panel for each individual to be imputed. There are several advantages to this approach:
- Investigators do not need to waste time deciding which haplotypes to include in the reference panel. Good results can be obtained in any study population by tuning a single software parameter (-k_hap) with a simple rule of thumb; see below for more details.
- This strategy works in a variety of human populations. Our group and others have used this approach to successfully impute populations ranging from homogeneous isolates to recent and complex admixtures.
- IMPUTE2 is often more accurate with an ancestrally inclusive reference panel than with a smaller panel chosen by intuition. This is because individuals from "diverged" populations may still share genomic segments of recent common ancestry, and IMPUTE2 can use this haplotype sharing to improve accuracy. At the same time, the software can ignore haplotypes that are not helpful.
The benefits of using inclusive reference panels are greatest at low-frequency variants (MAF < 5%), since these variants may be poorly represented in a reference panel from the population of interest (due to sampling effects) but well-represented in panel from a different population (e.g., due to genetic drift). - IMPUTE2 can efficiently process large reference panels. You might worry that using all available reference haplotypes would greatly increase the computational burden of imputation, but IMPUTE2 uses an approximation that limits the cost of adding reference haplotypes while maintaining (or improving) accuracy.
Practical suggestions
There are a few program settings that you should be aware of when using IMPUTE2 with an ancestrally diverse reference panel:
- -k_hap�This parameter determines how many of the reference haplotypes will be used in the "custom" reference panel for each study individual. The default value is 500, which is a good starting point for modern reference datasets.As a rule of thumb, you should set -k_hap to the number of reference haplotypes that you expect to be useful for your study population. For example, suppose you were imputing a Spanish dataset from a reference panel containing 400 Western European haplotypes and 400 African American haplotypes. In this case, you could achieve high accuracy by leaving -k_hap at the default value of 500 since, in any part of the genome, the expected number of reference haplotypes with European ancestry is roughly 400 + 0.2 * (400) = 480. (This calculation assumes that, on average, African American haplotypes have 20% European ancestry.)
Imputation accuracy is not very sensitive to -k_hap, which is why this rule of thumb usually provides good results without requiring detailed parameter tuning. If you want advice on the best value for your dataset, please send a message to our mail list. - -Ne�This parameter controls the effective population size in the population-genetic model used by IMPUTE2. Different human populations have different effective sizes (as estimated from genetic diversity levels), so it is not obvious how to choose a single -Ne value when using a multi-population reference panel.
Fortunately, we have found that IMPUTE2 achieves high accuracy across a wide range of -Ne values, with slightly higher accuracy at large values.
We therefore recommend a universal -Ne value of20000, regardless of the study population being imputed or the composition of the reference panel. This will become the default value in our next software release (v2.1.3), but for now you should set it manually. - -int�This command-line option specifies the boundaries of the region to be imputed on the current chromosome, using two numbers. For example, "-int 1 5e6" tells IMPUTE2 to analyze physical positions 1-5,000,000.
The imputation interval should not be too large because this weakens IMPUTE2's approximation for choosing custom reference panels, which is based on an assumption of limited recombination in the region being analyzed. In theory, it might be desirable to tailor the interval size to the population being imputed�e.g., to use shorter intervals in African populations�but in practice, we have found that the exact size of the interval has little effect on imputation accuracy as long as the interval is relatively small (say, < 10 Mb).
We therefore recommend that the size of the analysis interval be chosen for computational convenience, without regard to the ancestry of the study or reference datasets.
How does it work?
As explained above, we believe that the best way to use IMPUTE2 with modern reference panels is to provide all available haplotypes to the program and let it choose which ones to use. Here, we explain how this approach works.
IMPUTE2 does not use population labels or other genome-wide measures of relatedness between individuals, either for the reference haplotypes or the individuals being imputed. Instead, it looks for reference haplotypes that share high sequence identity with the haplotypes of a particular study individual. These haplotypes constitute a "custom" reference panel that can be used to impute missing genotypes in the individual of interest.
This process is largely insensitive to the ancestral composition of the reference panel: as long as the panel contains haplotypes that share segments of recent common ancestry with individuals in a study, IMPUTE2 can find the shared segments and use them to impute missing alleles. Consequently, the reference panel does not need to be restricted to haplotypes that "match" the ancestry of the study individuals�it can also include other kinds of haplotypes:
- Recently admixed haplotypes�If two or more distinct populations have mixed within the past few hundred years, the resulting admixed population may contain some haplotype segments that are closely related to a population of interest and other segments that are highly diverged. IMPUTE2 can identify the useful segments while ignoring the diverged segments, thereby achieving accurate imputation.
- Moderately diverged haplotypes�Even if a set of reference haplotypes comes from a different population than the one you want to impute, it may still provide segments of recent ancestry that can help the imputation. The prevalence of such segments is a complicated function of reference panel size and population history, but in our experience there is often a surprising amount of ancestry sharing between genetically distinct populations.
- Highly diverged haplotypes�Reference haplotypes that are highly diverged from your study population are unlikely to be useful for imputation, but such haplotypes are easily identified and ignored by IMPUTE2. In other words, highly diverged reference haplotypes neither help nor hurt imputation accuracy. This is important because the distinction between "moderately" and "highly" diverged populations is not always clear; since it does not hurt to include unhelpful reference haplotypes, we can err on the side of including too many in order to capture more of the moderately diverged ones that improve imputation accuracy.
Expert users will note that the model underlying IMPUTE2 is formally designed to represent genetic variation in a single population. This might imply that the method would have trouble using reference panels that include populations with different linkage disequilibrium patterns, nucleotide diversity levels, and allele frequency spectra. However, we have found that the IMPUTE2 is extremely adaptable: it can find segments of shared ancestry in multi-population reference panels despite its simple model of human populations, and it is largely robust to changes in its model parameters. Imputation accuracy might theoretically be improved by more detailed modeling of population relationships (for example, the population labels that IMPUTE2 ignores might sometimes be informative), but we believe that our approach captures most of the potential accuracy in an efficient way.
Published results
We published our work supporting these ideas in an article called "Genotype imputation with thousands of genomes" in the open-access journal G3: Genes, Genomes, Genetics. Please cite this paper and the original IMPUTE2 paper when using IMPUTE2 with multi-population reference panels like those from the 1,000 Genomes Project.
Examples
This section provides some example commands that illustrate typical applications of IMPUTE2. All of the data files used in these commands are included in the Example/ directory that comes with the software download. You should run the commands from the main download directory (i.e., the one that contains the impute2 executable). Detailed explanations are provided at each link below.
Run type | Description |
---|---|
Imputation with one phased reference panel | Basic scenario in which most people will use IMPUTE2. |
Imputation with one phased reference panel (pre-phasing) | As above, but with pre-phasing functionality to speed up the analysis. |
Imputation with one phased reference panel (chromosome X) | Basic imputation scenario applied to human chromosome X, which requires special program options. |
Imputation with one phased reference panel (plus variant filtering) | Basic imputation scenario with flexible filtering of reference panel variants. |
Imputation with one unphased reference panel | Basic imputation scenario adapted to unphased reference genotypes. |
Imputation with two phased reference panels | Extended functionality for imputing from multiple reference panels defined on different sets of variants. |
Imputation with two phased reference panels (merge reference panels) | Merge reference panels defined on different sets of variants and use combined panel for imputation. |
Imputation with one phased and one unphased reference panel | Specialized method for combining reference panels of different types. |
Imputation with one phased and one unphased reference panel, with additional options | As above, but illustrating a variety of options that can be used to customize the behavior of IMPUTE2. |
Phasing | Methodology for inferring haplotypes from unphased genotypes. |
Phasing with a reference panel | Phasing analysis aided by reference haplotypes. |
How to use example commands
All of the data files in the example commands below are included in the Example/ directory that comes with the IMPUTE2 software download. You should run the command from the main download directory, which is the one that contains the impute2 executable. For example, if you just downloaded a software package named impute_v2.X.Y_i386.tgz and unpacked it according to the directions here, you can reach the appropriate directory by typing "cd impute_v2.X.Y_i386/" on the command line.
Once you have found the right directory, you should be able to run the example command by entering it into a Unix-style terminal window. Depending on the settings of your computer, this may be as simple as highlighting the command text in your web browser, using the browser's Copy command, and then using the Paste command in your terminal window. (You may then need to hit Enter to start the run.)
Note that most lines in the example command end with the '\' character. This is not actually part of the command; it is just a shorthand notation that means "keep reading the next line as part of a single command." We use this notation to split the command over multiple lines so it is easier to read. This is a valid way to enter commands in a Unix-style terminal window, but it would be equivalent to put all of the arguments on a single line, separated by spaces.
You do not have to run IMPUTE2 exactly as in the example. Some of the arguments shown here are optional, and there are many other options that could be added to modify the behavior of the program. For a full list of available options, see here.
Most of the examples below include the string "-int 20.4e6 20.5e6", which tells the program to produce results for a 100 kb region (positions 20,400,000-20,500,000) on a single chromosome. IMPUTE2 assumes there is only one chromosome per input file, and that all input files in a single run come from the same chromosome. Applying the program to a much larger region�say, a whole chromosome or the whole genome�requires running many such jobs with different values of the -int parameter, usually in parallel on a computing cluster. For more details about how to do this, see here.
Imputation with one phased reference panel
This is the most common genotype imputation scenario: we want to impute untyped SNPs in a study dataset from a panel of reference haplotypes.
The following command shows how to run this kind of analysis with IMPUTE2, using the example data that come with the program download:
./impute2 \
-m ./Example/example.chr22.map \
-h ./Example/example.chr22.1kG.haps \
-l ./Example/example.chr22.1kG.legend \
-g ./Example/example.chr22.study.gens \
-strand_g ./Example/example.chr22.study.strand \
-int 20.4e6 20.5e6 \
-Ne 20000 \
-o ./Example/example.chr22.one.phased.impute2
Comments
- Here we have used the -strand_g option to provide a strand file to the program. This file tells IMPUTE2 how to align the allele coding between the study genotypes (-g file) and the reference haplotypes (-h and -l files). You must always align the allele codings across your input datasets, either before running IMPUTE2 or during a run with the options described here.
- This command invokes the standard MCMC algorithm used by IMPUTE2, which usually provides accurate results in a reasonable amount of time. Another way to run this kind of analysis is to use our pre-phasing approach, which decreases the running time by orders of magnitude at the cost of a small drop in imputation accuracy. To see how to run this example with pre-phasing, click here.
Imputation with one phased reference panel (pre-phasing)
This is the most common genotype imputation scenario: we want to use a panel of reference haplotypes to impute SNPs that were not typed in a study. Here, we show how to perform this task viapre-phasing, which is an approach that speeds up the imputation process by splitting it into two steps: (i) statistically phase the study genotypes; (ii) impute from the reference panel into the estimated study haplotypes.
The following commands show how to run this kind of analysis with IMPUTE2, using the example data that come with the program download:
Step 1: Pre-phasing
./impute2 \
-prephase_g \
-m ./Example/example.chr22.map \
-g ./Example/example.chr22.study.gens \
-int 20.4e6 20.5e6 \
-Ne 20000 \
-o ./Example/example.chr22.prephasing.impute2
Step 2: Imputation into pre-phased haplotypes
./impute2 \
-use_prephased_g \
-m ./Example/example.chr22.map \
-h ./Example/example.chr22.1kG.haps \
-l ./Example/example.chr22.1kG.legend \
-known_haps_g ./Example/example.chr22.prephasing.impute2_haps \
-strand_g ./Example/example.chr22.study.strand \
-int 20.4e6 20.5e6 \
-Ne 20000 \
-o ./Example/example.chr22.one.phased.impute2
-phase
Comments
- Pre-phasing is a useful technique for speeding up an imputation run, but it is even more useful if you want to impute a single study dataset from different reference panels (e.g., successive updates to the reference haplotypes released by the 1,000 Genomes Project). In that situation, you can perform the pre-phasing step just once and save the estimated haplotypes; you can then use the same study haplotypes to perform the imputation step with each new reference panel.
- If you are using IMPUTE2 for both the pre-phasing and subsequent imputation, it is important to use the same values of the -int parameter in both steps.
- The -prephase_g flag activates a couple of features that are necessary for pre-phasing. First, it tells the program to estimate and print phased haplotypes at SNPs included in the -g file; the haplotypes will be written to a file named "[-o]_haps", where [-o] is the name supplied for the main output file. These haplotypes will include SNPs in the buffer regions that flank the main region specified via -int. Extending the haplotypes into the buffer regions helps prevent edge effects in downstream imputation runs.
- It is possible to include a reference panel in the pre-phasing step, and this may improve the phasing quality. See here for an example of this kind of analysis (note that the linked example is missing the -prephase_g flag). To expedite the pre-phasing in this scenario, the program will not impute reference-only variants when -prephase_g is active, although you can override this behavior with the -os option.
- You can use the -strand_g option in either the pre-phasing or downstream imputation step, but you should not use it in both. Strand alignment is not usually necessary when you just want to phase a dataset, but it is important when that dataset will be combined with a reference panel in a downstream analysis, as in this case.
- Note that the file supplied to the -known_haps_g argument in the imputation step is the estimated haplotypes file from the pre-phasing step ("[-o]_haps"). Also note that the -use_prephased_g flag must be provided when imputing into pre-phased haplotypes.
- The -phase option in Step 2 above produces a file containing the haplotypes at the imputed and genotyped sites. In this example, the file would be called example.chr22.one.phased.impute2_haps
- Pre-phasing based imputation on chromosome X is also possible. The only things you need to do differently are to make sure that you supply a sample file for your data using the -sample_g option (so IMPUTE2 knows which individuals are male and which are female), and use the -chrX flag, and ensure that your male haploid genotypes are encoded according to the described file format. If you use SHAPEIT2 for the phasing step then this software also has options to phase chromosome X. Below is an example of how to carry out chromosome X pre-phasing based imputation
Step 1: Chromosome X Pre-phasing
./impute2 \
-prephase_g \
-chrX \
-m ./Example/chrX/example.chrX.map \ -g ./Example/chrX/example.chrX.study.gen \
-sample_g ./Example/chrX/example.chrX.study.sample \
-int 10.3e6 10.7e6 \
-Ne 20000 \
-o ./Example/chrX/example.chrX.prephasing.impute2
Step 2: Imputation into pre-phased chromosome X haplotypes
./impute2 \
-use_prephased_g \
-chrX \
-m ./Example/chrX/example.chrX.map \
-h ./Example/chrX/example.chrX.reference.hap \
-l ./Example/chrX/example.chrX.reference.legend \
-known_haps_g ./Example/chrX/example.chrX.prephasing.impute2_haps \
-int 10.3e6 10.7e6 \
-Ne 20000 \
-o ./Example/chrX/example.chrX.one.phased.impute2
-phase
Imputation with one phased reference panel (chromosome X)
This example provides a twist on the common scenario of imputing untyped SNPs in a study dataset from a panel of reference haplotypes. Here, we want to perform the analysis on chromosome X, which requires special treatment due to the hemizygosity of males. (This example and the files in our download packages focus on the non-pseudoautosomal part of chromosome X.)
The following command shows how to run this kind of analysis with IMPUTE2, using the example data that come with the program download:
./impute2 \
-chrX \
-m ./Example/chrX/example.chrX.map \
-h ./Example/chrX/example.chrX.reference.hap \
-l ./Example/chrX/example.chrX.reference.legend \
-g ./Example/chrX/example.chrX.study.gen \
-sample_g ./Example/chrX/example.chrX.study.sample \
-int 10.3e6 10.7e6 \
-Ne 20000 \
-o ./Example/chrX/example.chrX.one.phased.impute2
Comments
- The -chrX flag is essential because it tells IMPUTE2 to expect the special file formatting conventions used for chromosome X data.
- Whenever you analyze data on chromosome X, you must also provide a -sample_g file so that the program knows which individuals are males and which are females. You can learn about the specific requirements of this file here.
- There is no need to use a different -Ne value on chromosome X than you would on the autosomes; the -chrX flag tells IMPUTE2 to automatically reduce the value by 25%, which changes the parameters of the haplotype copying model.
- Like the input files, the IMPUTE2 output files from chromosome X analyses should be interpreted according to these conventions.
File formats for chromosome X
Among human chromosomes, chromosome X is unique in that it is dizygous (two copies) in females but hemizygous (one copy) in males. To deal with chromosome X data, IMPUTE2 requires that you use the -chrX flag and make some small changes to the input file formats.
- Genotypes file (-g): As in a standard -g file, each study individual should have three columns (genotype probabilities) per SNP. For females, these have the standard interpretation that columns 1, 2, and 3 represent P(G=0), P(G=1), and P(G=2), respectively, where G=1 is the heterozygous state. Males have only two possible genotypes on chromosome X, and we encode these in columns 1 and 3; column 2, which corresponds to P(G=1), should always be zero in this setting, and non-zero values in this column will automatically be truncated to zero for males when the -chrX flag is active.
- Sample file (-sample_g): In order for the input genotype convention explained above to work, IMPUTE2 needs to know which study individuals are males and which are females. This is accomplished by adding an extra column named 'sex' to the -sample_g file, which is required when using the -chrX flag. This column should be coded as type 'D' (discrete covariate), where males are indicated by '1's and females are indicated by '2's. Here is an example snippet where the first individual is female and the second and third individuals are male:
ID_1 ID_2 missing sex
0 0 0 D
INDIV1 INDIV1 0.0 2
INDIV2 INDIV2 0.0 1
INDIV3 INDIV3 0.0 1 - Reference haplotypes file (-h): It does not usually matter which reference individuals are male or female when their genotypes have already been phased. However, it may sometimes be convenient to create a -h file with two columns per individual, so IMPUTE2 allows the presence of dummy columns made of '-' characters to represent the non-existent second haplotypes of males on chromosome X. For example, here is a small haplotypes file with 5 SNPs (one per row) typed in a female (columns 1-2) and two males (columns 3-4 and 5-6):
0 1 1 - 0 -
0 0 1 - 1 -
1 0 0 - 1 -
1 1 0 - 1 -
0 0 1 - 0 -
The dummy columns are optional�the following would be an equally valid format for the same file:
0 1 1 0
0 0 1 1
1 0 0 1
1 1 0 1
0 0 1 0 - Output files (-o): The main output file will follow the same convention as the genotypes file described above: each individual has three entries per SNP, but the middle entry is set to zero for males. When IMPUTE2 produces haplotype output files for chromosome X, both males and females will have two columns per individual, although the second column for each male will be filled with dummy values of '-'.
Imputation with one phased reference panel (plus variant filtering)
This example provides a twist on the common scenario of imputing untyped SNPs in a study dataset from a panel of reference haplotypes. Here, we want to perform the analysis after flexibly removing a subset of sites from the reference panel.
The following command shows how to run this kind of analysis with IMPUTE2, using the example data that come with the program download:
./impute2 \
-filt_rules_l 'eur.maf<0.01' 'afr.maf<=0.05' 'TYPE==LOWCOV' \
-m ./Example/example.chr22.map \
-h ./Example/example.chr22.1kG.haps \
-l ./Example/example.chr22.1kG.annot.legend \
-g ./Example/example.chr22.study.gens \
-strand_g ./Example/example.chr22.study.strand \
-int 20.4e6 20.5e6 \
-Ne 20000 \
-o ./Example/example.chr22.one.phased.impute2
Comments
- The main novelty here is the use of the -filt_rules_l option. This option works by defining "filtering rules" that combine annotation categories (here, eur.maf and afr.maf and TYPE) with comparison operators (< and <= and ==) and values (0.01 and 0.05 and LOWCOV). Each annotation string is present on the first line of the -l file and is followed by a column of numeric or character values (one for each site in the reference panel) that determine whether a given site should be filtered from the reference set. In this example, the filtering rules tell IMPUTE2 to ignore reference variants with minor allele frequency less than 1% in a European panel OR less than 5% in an African panel OR sites that are annotated as LOWCOV (Filtering rules are always applied in 'OR' fashion.)
- You can make your own filtering rules by adding numeric or character annotation columns to a reference legend (-l) file, or you can use the annotations that we provide in some of our reference panel download packages. For example, we have included continent-level minor allele frequencies in the legend files for the 1,000 Genomes Phase 1 integrated variant reference panel.
- USAGE GUIDELINES FOR FILTERING RULES: Our main motivation in creating the -filt_rules_l option was to provide a fast and easy way of reducing the computational burden of large, sequence-based reference panels. A principled way to do this is to remove the reference SNPs that are expected to provide the least power in an imputation-based association analysis. We suggest that the rarest SNPs in a dataset fall into this category, both because there is generally less power to detect these under many study designs and because such SNPs are often harder to impute, which further diminishes the real power for detection. So, one simple approach is to use a minor allele frequency filtering rule (e.g., 'eur.maf<0.01') for MAF annotations from a population like the one being studied.
Imputation with one unphased reference panel
It is not necessary for the reference panel to be phased: IMPUTE2 can do the phasing internally while accounting for the phase uncertainty. To use an unphased reference panel, simply replace the -h and -l files with a -g_ref file.
The following command shows how to run this kind of analysis with IMPUTE2, using the example data that come with the program download:
./impute2 \
-m ./Example/example.chr22.map \
-g_ref ./Example/example.chr22.reference.gens \
-strand_g_ref ./Example/example.chr22.reference.strand \
-g ./Example/example.chr22.study.gens \
-strand_g ./Example/example.chr22.study.strand \
-int 20.4e6 20.5e6 \
-Ne 20000 \
-o ./Example/example.chr22.one.unphased.impute2
Comments
- As with any imputation analysis, it is important that all of your input files be aligned to the same allele coding at shared SNPs. In this example, we assume that both the -g_ref and -g files include SNPs that are not aligned to the '+' strand of the human genome reference sequence, so we use the -strand_g_ref and -strand_g options to bring them into alignment.
- This procedure is not recommended for unphased reference panels that have high SNP density, such as those that result from resequencing studies of population samples. In that situation, there may be statistical convergence issues that could decrease the imputation quality. If you need advice on how to use that kind of reference dataset, please send a message to our mail list.
Imputation with two phased reference panels
It is sometimes helpful to use multiple reference panels to impute genotypes in a single study. For example, we previously recommended combining reference haplotypes from the 1,000 Genomes Pilot Project and HapMap 3: the first set provided extensive coverage of polymorphisms in the genome, while the second set provided greater sample size at a subset of SNPs. We no longer recommend that you use this hybrid reference panel because the 1,000 Genomes Project has generated even richer reference sets (which you can download here), but some investigators may have additional reference data that could be used in this way.
The following command shows how to run this kind of analysis with IMPUTE2, using the example data that come with the program download:
./impute2 \
-m ./Example/example.chr22.map \
-h ./Example/example.chr22.1kG.haps \
./Example/example.chr22.hm3.haps \
-l ./Example/example.chr22.1kG.legend \
./Example/example.chr22.hm3.legend \
-g ./Example/example.chr22.study.gens \
-strand_g ./Example/example.chr22.study.strand \
-int 20.4e6 20.5e6 \
-Ne 20000 \
-o ./Example/example.chr22.two.phased.impute2
Comments
- This is a somewhat complicated scenario, and some restrictions are necessary to make sure the statistical machinery will produce good results. Ideally, one reference panel should contain a subset of the SNPs typed in the other reference panel, and the study dataset should contain a subset of the SNPs typed in both reference panels. If your dataset deviates substantially from these conditions, you may obtain sub-optimal imputation accuracy. Please send a message to ourmail list if you want advice on whether this scheme will work with your data.
- Assuming the conditions described above are nearly satisfied, the reference panel with a larger number of SNPs should always come first on the command line. In this example, there are more SNPs in the 1,000 Genomes ("1kG") panel than in the HapMap 3 ("hm3") panel, so the 1,000 Genomes files are listed first after the -h and -l arguments.
- Here we have used the -strand_g option to provide a strand file to the program. This file tells IMPUTE2 how to align the allele coding between the study genotypes (-g file) and the reference haplotypes (-h and -l files; assumed to be aligned to the '+' strand of the human genome reference sequence). You must always align the allele codings across your input datasets, either before running IMPUTE2 or during a run with the options described here.
Imputation with two phased reference panels (merge reference panels)
Many investigators have access to multiple reference panels that could inform their imputation analyses. For example, they might want to supplement the 1,000 Genomes haplotypes (which can be downloaded here) with dedicated sequencing data from a study population.
If you have two panels that have been phased and put into IMPUTE2's reference format (legend/haplotype file pairs), you can ask the program to merge them internally and impute your study genotypes by entering the following command, which uses example data that come with the program download:
./impute2 \
-merge_ref_panels \
-m ./Example/example.chr22.map \
-h ./Example/example.chr22.1kG.haps \
./Example/example.chr22.hm3.haps \
-l ./Example/example.chr22.1kG.legend \
./Example/example.chr22.hm3.legend \
-g ./Example/example.chr22.study.gens \
-strand_g ./Example/example.chr22.study.strand \
-int 20.4e6 20.5e6 \
-Ne 20000 \
-o ./Example/example.chr22.two.phased.impute2
Comments
- For details on how the reference panel merging works, please read the documentation.
- This approach also works with pre-phased study haplotypes. To use pre-phased study data in this example, you would replace the -g file with a -known_haps_g file and add the -use_prephased_g flag to your IMPUTE2 command.
- If you want to print the merged, phased panel in IMPUTE2 reference format (one -l file and one -h file), you should add the -merge_ref_panels_output_ref flag.
- If you want to print the merged, unphased panel in IMPUTE2 genotype format (one -g file), you should add the -merge_ref_panels_output_gen flag.
- If you simply want to merge two reference panels without imputing missing genotypes in a study dataset, you should add the -merge_ref_panels_output_ref or -merge_ref_panels_output_gen flag and omit the study genotypes (-g or -known_haps_g file) from your IMPUTE2 command.
Imputation with one phased and one unphased reference panel
Sometimes it is useful to combine a phased reference panel with an unphased reference panel when imputing genotypes in a study. For example, Howie et al. (2009) considered a hybrid reference panel that included phased haplotypes from HapMap and unphased genotypes from population controls typed on multiple SNP chips (they referred to this configuration as "Scenario B"). By using the genetic information in both panels simultaneously, IMPUTE2 can achieve a better combination of accuracy and coverage than it would with either panel alone.
The following command shows how to run this kind of analysis with IMPUTE2, using the example data that come with the program download:
./impute2 \
-m ./Example/example.chr22.map \
-h ./Example/example.chr22.1kG.haps \
-l ./Example/example.chr22.1kG.legend \
-g_ref ./Example/example.chr22.reference.gens \
-strand_g_ref ./Example/example.chr22.reference.strand \
-g ./Example/example.chr22.study.gens \
-strand_g ./Example/example.chr22.study.strand \
-int 20.4e6 20.5e6 \
-Ne 20000 \
-o ./Example/example.chr22.one.phased.one.unphased.impute2
Comments
- This is a somewhat complicated scenario, and some restrictions are necessary to make sure the statistical machinery will produce good results. Ideally, the study data (-g file) should contain a subset of the SNPs in the unphased reference panel (-g_ref file), which should in turn contain a subset of the SNPs in the phased reference panel (-h and -l files). If your dataset deviates substantially from these conditions, you may obtain sub-optimal imputation accuracy. Please send a message to our mail list if you want advice on whether this scheme will work with your dataset.
- Here we have used the -strand_g and -strand_g_ref options to provide strand files to the program. These files tell IMPUTE2 how to align the allele coding of the study genotypes (-g file) and the unphased reference genotypes (-g_ref file) with the coding of the phased reference haplotypes (-h and -l files; assumed to be aligned to the '+' strand of the human genome reference sequence). You must always align the allele codings across your input datasets, either before running IMPUTE2 or during a run with the options described here.
- Additional options must be invoked if you want to include the -g_ref panel in your association tests (e.g., as part of your control set). This process requires a fair amount of imputation expertise, and we prefer to advise people about it on an individual basis. If you are interested in using this approach, please send a message to our mail list.
Imputation with one phased and one unphased reference panel, with additional options
Here we perform the same basic analysis as in this example, but we use a number of additional options to modify the behavior of IMPUTE2.
The following command shows how to run this kind of analysis with IMPUTE2, using the example data that come with the program download:
./impute2 \
-m ./Example/example.chr22.map \
-h ./Example/example.chr22.1kG.haps \
-l ./Example/example.chr22.1kG.legend \
-g_ref ./Example/example.chr22.reference.gens \
-strand_g_ref ./Example/example.chr22.reference.strand \
-exclude_snps_g_ref ./Example/example.chr22.reference.snp.exclusions \
-g ./Example/example.chr22.study.gens \
-strand_g ./Example/example.chr22.study.strand \
-align_by_maf_g \
-sample_g ./Example/example.study.samples \
-exclude_samples_g ./Example/example.study.sample.exclusions \
-int 20.4e6 20.5e6 \
-Ne 20000 \
-k 100 \
-burnin 5 \
-iter 20 \
-pgs \
-no_sample_qc_info \
-o_gz \
-o ./Example/example.chr22.complicated.impute2
Comments
- These comments will focus on the specialized options used in the example above; for comments on this general imputation scenario, see here.
- The -exclude_snps_g_ref option specifies a few SNPs to remove from the -g_ref file, using different types of SNP IDs. These might be SNPs that failed QC testing, for example.
- The -align_by_maf_g option tells the program to use minor allele frequencies to align the allele coding of A/T and C/G SNPs between the -g file and the -l file. However, the -strand_g option takes precedence over -align_by_maf_g, and in this case all of the genotyped SNPs have explicit alignments in the strand file, so the -align_by_maf_g flag has no effect.
- This run includes both a -sample_g file and an -exclude_samples_g file. The sample file tells IMPUTE2 which samples in the -g file are which, and the exclusions file tells it the IDs of samples that should be removed from the analysis. These might be individuals who showed systematic data quality problems on a genome-wide SNP chip, for example.
- Here we have increased -k from its default value of 80 to 100. This will increase the imputation accuracy, but it will also increase IMPUTE2's running time. In this example we have tried to offset the increased running time by decreasing the -burnin value from 10 (default) to 5 and the -iter value from 30 (default) to 20.
- The -pgs flag tells the program to "predict genotyped SNPs"; that is, to replace the original study genotypes with LD-based imputed genotypes in the output file.
- The -no_sample_qc_info flag suppresses the output file that shows quality control metrics for each individual in the -g file.
- The -o_gz flag specifies that the main output file should be compressed by the gzip algorithm; this is useful if you are running jobs that produce large output files.
Phasing
Although IMPUTE2 was originally designed to impute missing genotypes, it can also be used for a classical phasing analysis in which we want to infer the haplotypes underlying a set of observed genotypes. This functionality is activated via the -phase option.
The following command shows how to run this kind of analysis with IMPUTE2, using the example data that come with the program download:
./impute2 \
-phase \
-m ./Example/example.chr22.map \
-g ./Example/example.chr22.study.gens \
-int 20.4e6 20.5e6 \
-Ne 20000 \
-o ./Example/example.chr22.phasing.impute2
Comments
- The -o file is always reserved for imputation output, so the phased haplotypes in this example get printed to a file named ./Example/example.chr22.phasing.impute2_haps, where the _haps suffix is added automatically. The format of this output file is explained here.
- No strand alignment is needed in this example since we are using only one data panel. However, it may be important to align the strand at this stage if you intend to use the phased haplotypes for downstream imputation, i.e. in a pre-phasing analysis.
- In our experience this phasing procedure works well for SNP chip data, but it may have statistical convergence issues in datasets with high marker density, such as those that result from resequencing studies of population samples. If you would like to phase that kind of dataset, please send a message to our mail list for suggestions about how to improve the quality of inference.
We have not yet posted instructions for how to reattach phased haplotypes across successive chunks along a chromosome. If you want to try this approach to phasing a whole chromosome, please send a message to our mail list.
Phasing with a reference panel
Although IMPUTE2 was originally designed to impute missing genotypes, it can also be used for a classical phasing analysis in which we want to infer the haplotypes underlying a set of observed genotypes. This functionality is activated via the -phase option.
Here, we extend a basic phasing analysis to incorporate a phased reference panel. Population-based phasing methods work by pooling linkage disequilibrium information across individuals, so adding a panel of high-quality haplotypes can improve phasing accuracy.
The following command shows how to run this kind of analysis with IMPUTE2, using the example data that come with the program download:
./impute2 \
-phase \
-m ./Example/example.chr22.map \
-h ./Example/example.chr22.1kG.haps \
-l ./Example/example.chr22.1kG.legend \
-g ./Example/example.chr22.study.gens \
-strand_g ./Example/example.chr22.study.strand \
-int 20.4e6 20.5e6 \
-Ne 20000 \
-o ./Example/example.chr22.phasing.impute2
Comments
- The -o file is always reserved for imputation output, so the phased haplotypes in this example get printed to a file named ./Example/example.chr22.phasing.impute2_haps, where the _haps suffix is added automatically. The format of this output file is explained here.
- The reference panel in this example includes SNPs that are not present in the -g file. IMPUTE2 can simultaneously impute the untyped SNPs and phase the typed SNPs in that file, but it will not phase the untyped SNPs; the main output file (./Example/example.chr22.phasing.impute2) will include estimated genotypes for all study + reference SNPs, but the phased haplotype output file (./Example/example.chr22.phasing.impute2_haps) will include only the SNPs from the -g file. We decided not to have the program produce haplotypes at reference-panel-only SNPs because the computation needed to provide good estimates is much greater than that needed to phase just the input genotypes or to impute the untyped SNPs without phasing them. If you really want to try phasing the untyped SNPs as well, please send a message to our mail list.
- If you don't care about imputing the reference-panel-only SNPs into your study data (i.e., you just want to phase the original genotypes), you can substantially speed up the inference by adding "-os 2" to the command line. This tells the program to "output SNPs of type 2", which are ones with input data in both the reference and study panels. By implicitly telling the program not to output other kinds of SNPs (e.g., those typed only in the reference panel), you allow it to avoid wasting calculations that won't contribute to the final output.
- Here we have used the -strand_g option to provide a strand file to the program. This file tells IMPUTE2 how to align the allele coding between the study genotypes (-g file) and the reference haplotypes (-h and -l files). You must always align the allele codings across your input datasets, either before running IMPUTE2 or during a run with the options described here.
- In our experience this phasing procedure works well for SNP chip data, but it may have statistical convergence issues in datasets with high marker density, such as those that result from resequencing studies of population samples. If you would like to phase that kind of dataset, please send a message to our mail list for suggestions about how to improve the quality of inference.
We have not yet posted instructions for how to reattach phased haplotypes across successive chunks along a chromosome. If you want to try this approach to phasing a whole chromosome, please send a message to our mail list.
Program Options
These links explain the command-line arguments that can be used to control IMPUTE2.
Option type | Description |
---|---|
Required arguments | The program will not run if these are not supplied. |
Input file options | A list of possible input files, with formatting requirements. |
Output file options | Naming conventions and options for controlling format of output files. |
Basic options | Options for controlling how the program processes input data. |
Strand alignment options | Options for aligning allele coding across data files. |
Filtering options | Options for controlling the filters that get applied to input data. |
MCMC options | Options for controlling the MCMC algorithm. |
Pre-phasing options | Options that facilitate pre-phasing and subsequent imputation. |
Panel merging options | Options for merging a pair of reference panels. |
Chromosome X options | Options for analyzing chromosome X data. |
Expert options | Options to be used by experts only. |
Required arguments
This table shows the input arguments that you must supply in order for IMPUTE2 to run. These are just the minimum requirements; the program will not do anything useful unless you also supply other input options and/or data files.
Flag | Default | Description |
---|---|---|
-g REQUIRED unless -known_haps_g provided | none | File containing genotypes for a study cohort that you want to impute or phase. The format of this file is described on our file format webpage and is the same as the output format from our genotype calling program CHIAMO. If you do not supply a file of unphased genotypes via this argument, you must supply a file of phased study haplotypes via the -known_haps_g option. |
-m REQUIRED | none | Fine-scale recombination map for the region to be analyzed. This file should have three columns: physical position (in base pairs), recombination rate between current position and next position in map (in cM/Mb), and genetic map position (in cM). The file should also have a header line with an unbroken character string for each column (e.g., "position COMBINED_rate(cM/Mb) Genetic_Map(cM)"). All of our reference panel download packages come with appropriate recombination map files. |
-int REQUIRED | none | Genomic interval to use for inference, as specified by <**lower**> and <**upper**> boundaries in base pair position. The boundaries can be expressed either in long form (e.g., -int 5420000 10420000) or in exponential notation (e.g., -int 5.42e6 10.42e6). This option is particularly useful for restricting test jobs to small regions or splitting whole-chromosome analyses into manageable chunks, as discussed in the section on analyzing whole chromosomes. IMPUTE2 requires that you specify an analysis interval in order to prevent accidental whole-chromosome analyses. If you want to impute a region larger than 7 Mb (which is not generally recommended), you must activate the -allow_large_regions flag. |
Input file options
This table explains the formatting requirements for input data files that can be supplied to IMPUTE2. Some of these files allow more than one ID per SNP, but the program identifies SNPs internally by their base pair positions (which means that duplicate SNPs at a single position can cause problems). In all of these files, it is important that SNPs appear in base pair position order, from lowest to highest. It is also crucial that all SNP positions come from the same genome assembly (e.g., NCBI Build 37) so the program can combine information across input files.
Flag | Default | Description |
---|---|---|
-g REQUIRED unless -known_haps_g provided | none | File containing genotypes for a study cohort that you want to impute or phase. The format of this file is described on our file format webpage and is the same as the output format from our genotype calling program CHIAMO. If you do not supply a file of unphased genotypes via this argument, you must supply a file of phased study haplotypes via the -known_haps_g option. |
-m REQUIRED | none | Fine-scale recombination map for the region to be analyzed. This file should have three columns: physical position (in base pairs), recombination rate between current position and next position in map (in cM/Mb), and genetic map position (in cM). The file should also have a header line with an unbroken character string for each column (e.g., "position COMBINED_rate(cM/Mb) Genetic_Map(cM)"). All of our reference panel download packages come with appropriate recombination map files. |
-h <file 1> <file 2> | none | File of known haplotypes, with one row per SNP and one column per haplotype. All alleles must be coded as 0 or 1, and each -h file must be provided with a corresponding legend file. We provide formatted haplotypes from the HapMap Project and the 1,000 Genomes Project in our reference panel download packages. In IMPUTE2, it is possible to specify two -h files. In this case, the file with more SNPs should be provided first (in the <file 1> position) and the file with fewer SNPs should be provided second (in the <file 2> position), with a single space separating the file names. |
-l <file 1> <file 2> | none | Legend file(s) with information about the SNPs in the -h file(s). Each file should have four columns: rsID, physical position (in base pairs), allele 0, and allele 1. The last two columns specify the alleles underlying the 0/1 coding in the corresponding -h file; these alleles can take values in {A,C,G,T}. Each legend file should also have a header line with an unbroken character string for each column (e.g., "rsID position a0 a1"). We provide legend files for data from the HapMap Project and the 1,000 Genomes Project in our reference panel download packages. When using two -h files with IMPUTE2, you must supply the corresponding legend files in the same order�i.e., the file with more SNPs comes first. |
-g_ref | none | File containing unphased genotypes to use as a reference panel for imputation. This file should follow the same format as the -g file. A -g_ref file can be used as the lone reference panel for imputation, or it can be combined with a single -h file to create a two-tiered reference panel (in the latter case, the -g_ref file should contain roughly a subset of the SNPs in the -h file). |
-known_haps_g | none | File containing known haplotypes for the study cohort. The format is the same as the output format from IMPUTE2's-phase option: five header columns (as in the -g file) followed by two columns (haplotypes) per individual. Allowed values in the haplotype columns are 0, 1, and ?. If your study dataset is fully phased, you can replace the-g file with a -known_haps_g file. This will cause IMPUTE2 to perform haploid imputation, although it will still report diploid imputation probabilities in the main output file. If any genotypes are missing, they can be marked as '? ?' (two question marks separated by one space) in the input file. (The program does not allow just one allele from a diploid genotype to be missing.) If the reference panels are also phased, IMPUTE2 will perform a single, fast imputation step rather than its standard MCMC module�this is how the program imputes into pre-phased GWAS haplotypes. The -known_haps_g file can also be used to specify study genotypes that are "partially" phased, in the sense that some genotypes are phased relative to a fixed reference point while others are not. We anticipate that this will be most useful when trying to phase resequencing data onto a scaffold of known haplotypes. To mark a known genotype as unphased, place an asterisk immediately after each allele, with no space between the allele (0/1) and the asterisk (*); e.g., "0* 1*" for a heterozygous genotype of unknown phase. |
Output file options
The options in this table control the format and naming conventions of output files printed by IMPUTE2.
Flag | Default | Description |
---|---|---|
-o | ./test.impute2 | Name of main output file. Follows the same format as the -g file. |
-i | [-o]_info | Name of SNP-wise information file with one line per SNP and a single header line at the beginning. This file always contains the following columns (header tags shown in parentheses): 1. SNP identifier from -g file (snp_id) 2. rsID (rs_id) 3. base pair position (position) 4. expected frequency of allele coded '1' in the -o file (exp_freq_a1) 5. measure of the observed statistical information associated with the allele frequency estimate (info) [details] 6. average certainty of best-guess genotypes (certainty) 7. internal "type" assigned to SNP (type) Depending on the command-line options invoked, there may also be columns labeled info_typeX,concord_typeX, and r2_typeX. IMPUTE2 assigns every SNP an internal "type" which reflects the combination of input datasets that include data for that SNP; here, X gives the type, which takes values in {0,1,2}. You can learn how the program determines SNP types here. For SNPs that have genotypes in the -g file, concord_typeX is the concordance between the input genotypes and the best-guess imputed genotypes, where the input genotypes at that SNP have been masked internally and then imputed as if the SNP were of type X; similarly, r2_typeX is the squared correlation between input and masked/imputed genotypes at a SNP. The info_typeX column is the same information metric used in column 5, but here is it applied to genotypes that have been imputed from pseudo-type X SNPs in the leave-one-out masking experiment. These columns are useful for post-hoc quality control; we will soon explain how we use them in our section on Best Practices for Imputation. |
-r | [-o]_summary | Name of log file that records a summary of the screen output. |
-w | [-o]_warnings | Name of file that records warnings generated by IMPUTE2. |
-os ... | 0 1 2 3 | "Output SNPs": specifies the SNP types that will be printed to the output file (SNP labeling is discussed in the Overview). By default, all imputed and genotyped SNPs are included in the output, i.e., "-os 0 1 2 3". |
-o_gz | Specifies that the main output file should be compressed by the gzip utility; this also applies to some non-standard output files that can become large. | |
-outdp | 3 | Specifies the number of decimal places to use for reporting genotype probabilities in the main output file. |
-no_snp_qc_info | Suppresses printing of info_typeX,concord_typeX, and r2_typeX columns in the-i file. | |
-no_sample_qc_info | Suppresses printing of per-sample quality control metrics file. The default is to print a file named "[-i]_by_sample". | |
-phase | IMPUTE2 always implicitly phases the study genotypes (-g file), and this flag tells the program to print the best-guess haplotypes that result from the phasing process. In addition to the standard imputation output file, the program also prints a separate haplotype file named "[-o]_haps". This file contains the same five header columns as the standard output, along with two columns (haplotypes) per individual, in the same order they appear in the main output. In addition to this "best-guess" haplotype file, the program also prints the certainty that each successive pair of heterozygous SNPs is correctly phased. These certainties occur in a file named "[-o]_haps_confidence". In this file, homozygotes are represented by * characters and heterozygotes are represented by numbers between 0.5 and 1.0; this is the estimated probability that the phasing between the current heterozygote and the previous heterozygote (upstream) is correct. By convention, the first heterozygous SNP in each individual for a given analysis region is assigned a phasing certainty of 1.0. As illustrated by our example commands, it is possible to use the -phase option to produce haplotypes without the use of a reference panel; i.e., to perform a classical phasing analysis. | |
-pgs | "Predict Genotyped SNPs": Tells the program to replace the input genotypes from the -g file with imputed genotypes in the -o file (applies to Type 2 SNPs only). | |
-pgs_miss | Unlike -pgs, which replaces all input genotypes with imputed genotypes, this option tells the program to replace only the missing genotypes at typed SNPs. That is, any input genotype whose maximum probability exceeds the -call_thresh will simply be reprinted in the -o file, whereas input genotypes that fall below the calling threshold will be imputed in the output. WARNING: This is an appealing option that will "fill in" sporadically missing genotypes in your input data. However, it is possible that this could cause subtle problems in downstream association testing. We therefore suggest that you use caution when applying this option. |
Details about 'info' metric
IMPUTE2 reports an information metric in the fifth column of its -i file. This metric is similar to the r-squared metrics reported by other programs like MaCH and Beagle. Although each of these metrics is defined differently, they tend to be correlated.
Our metric typically takes values between 0 and 1, where values near 1 indicate that a SNP has been imputed with high certainty. The metric can occasionally take negative values when the imputation is very uncertain, and we automatically assign a value of -1 when the metric is undefined (e.g., because it wasn't calculated).
Investigators often use the info metric to remove poorly imputed SNPs from their association testing results. There is no universal cutoff value for post-imputation SNP filtering; various groups have used cutoffs of 0.3 and 0.5, for example, but the right threshold for your analysis may differ. One way to assess different info thresholds is to see whether they produce sensible Q-Q plots, although we emphasize that Q-Q plots can look bad for many reasons besides your post-imputation filtering scheme.
We define our info metric and compare it against other metrics in a review paper that we recently published. If you have questions, please read that material first, then send a message to our mail list if anything is still unclear.
Basic options
These options control some basic processing that the program does to prepare input data for inference.
Flag | Default | Description |
---|---|---|
-int REQUIRED | none | Genomic interval to use for inference, as specified by <**lower**> and <**upper**> boundaries in base pair position. The boundaries can be expressed either in long form (e.g., -int 5420000 10420000) or in exponential notation (e.g., -int 5.42e6 10.42e6). This option is particularly useful for restricting test jobs to small regions or splitting whole-chromosome analyses into manageable chunks, as discussed in the section on analyzing whole chromosomes. IMPUTE2 requires that you specify an analysis interval in order to prevent accidental whole-chromosome analyses. If you want to impute a region larger than 7 Mb (which is not generally recommended), you must activate the -allow_large_regions flag. |
-buffer | 250 kb | Length of buffer region (in kb) to include on each side of the analysis interval specified by the -int option. SNPs in the buffer regions inform the inference but do not appear in output files (unless you activate the -include_buffer_in_output flag). Using a buffer region helps prevent imputation quality from deteriorating near the edges of the analysis interval. Larger buffers may improve accuracy for low-frequency variants (since such variants tend to reside on long haplotype backgrounds) at the cost of longer running times. |
-allow_large_regions | Allows the analysis of regions larger than 7 Mb. If this flag is not activated and the analysis interval plus buffer region exceeds 7 Mb, the program will quit with an error. The rationale for this flag is described here. | |
-include_buffer_in_output | Tells the program to include SNPs from the -buffer region in all output files. The main reason for using this option is to preserve the buffer information for downstream imputation, e.g. when pre-phasing a GWAS dataset. | |
-Ne | 20000 | "Effective size" of the population (commonly denoted as_Ne_ in the population genetics literature) from which your dataset was sampled. This parameter scales the recombination rates that IMPUTE2 uses to guide its model of linkage disequilibrium patterns. When most imputation runs were conducted with reference panels from HapMap Phase 2, we suggested values of 11418 for imputation from HapMap CEU, 17469 for YRI, and 14269 for CHB+JPT. Modern imputation analyses typically involve reference panels with greater ancestral diversity, which can make it hard to determine the "ideal" -Ne value for a particular study. Fortunately, we have found that imputation accuracy is highly robust to different -Ne values; within each of several human populations, we have obtained nearly identical accuracy levels for values between 10000 and 25000. We suggest setting -Ne to 20000 in the majority of modern imputation analyses. |
-call_thresh | 0.9 | Threshold for calling genotypes in the -g file. For each individual at each SNP, the program will use the genotype with the maximum probability if that probability exceeds the threshold; otherwise, the genotype will be treated as missing. NOTE: This threshold applies only to input genotypes. If you want to apply a calling threshold to IMPUTE2's output probabilities, you will have to do it yourself. However, it is usually not a good idea to treat imputation output this way; see the webpage of our association-testing software SNPTEST for better suggestions. |
-nind | # of indiv in -g file | Number of individuals from the -g file to include in the analysis. For example, to impute only the first five individuals, set -nind 5. This option is useful for debugging and test runs. |
-verbose | Print detailed output about the progress of imputation. By default, IMPUTE2 prints only the number of the current MCMC iteration when performing imputation, but this flag tells it to print more detailed updates. |
Strand alignment options
In any imputation analysis, is it absolutely essential that all panels have their allele codings aligned to a fixed reference (usually the human genome reference sequence). The options in this table are meant to help align the allele codings in your input data files, but you should not assume that the program will do all the work for you. If you do not know exactly how your data were processed or what these options are doing, you should try to locate the original strand information or send a message to our mail list for assistance.
NOTE: IMPUTE2 will automatically align the strand between panels whenever it can do so unambiguously; e.g., flipping A/C in Panel 2 to match G/T in the reference. The options below pertain to variants where this is not possible, e.g. because an A/T SNP cannot be aligned by label alone.
NOTE: We currently assume that all phased reference files have already been aligned to the '+' strand of the human genome reference sequence, which is true of the files that we distribute; hence, the options here pertain only to study genotype files (like the -g and -known_haps_g files) and unphased reference files (i.e., a -g_ref file).
Flag | Default | Description |
---|---|---|
-strand_g | none | File showing the strand orientation of the SNP allele codings in the -g file, relative to a fixed reference point. Each SNP occupies one line, and the file should have two columns: (i) the base pair position of the SNP and (ii) the strand orientation ('+' or '-') of the alleles in the genotype file; the columns should be separated by a single space. The ordering of the SNPs in this file does not matter (by contrast to the -g file, which must be sorted by SNP position), and it is okay if some SNPs in the strand file are not present in the genotype file (e.g., due to filtering). We provide model strand files in the Example/ directory that comes with the software download. |
-strand_g_ref | none | Same as -strand_g, but applies to the -g_ref file. |
-align_by_maf_g | Activates the program's internal strand alignment procedure for the -g file (AKA Panel 2; for details about the panel nomenclature used here, see the overview). The strand is aligned to the alleles in reference Panel 0, if present, otherwise to reference Panel 1. This option pertains only to A/T and C/G SNPs, which it aligns such that Panel 2 and the alignment reference (Panel 0 or 1) have the same minor allele. NOTE: This flag can be used in conjunction with the-strand_g option. In that case, the information from the strand file takes precedence, i.e., the program will not try to align the strand of SNPs that have explicit strand info already. This is useful if you have strand information for some SNPs but not others. NOTE: You should take care when using this option. In particular, it can get the alignment wrong at A/T and C/G SNPs with minor allele frequencies near 50%, which can hurt the inference by distorting the local haplotype patterns. The best way to get the correct alignment at these kinds of SNPs is to track down the original assay and determine which strand was measured. This flag replaces -fix_strand_g as of IMPUTE v2.2. | |
-align_by_maf_g_ref | Similar to -align_by_maf_g, but applies to the -g_ref file (Panel 1). In this case the strand is aligned to the alleles in Panel 0, so the flag does not work if Panel 0 was not provided (i.e., if you did not supply -l and -h files). NOTE: Just as -align_by_maf_g can be used in conjunction with -strand_g, this flag can be used in conjunction with the -strand_g_ref option. As before, the strand file takes precedence over aligning the strand by MAF. NOTE: As with -align_by_maf_g, you should be careful about using this option to align A/T and C/G SNPs with minor allele frequencies near 50%. This flag replaces -fix_strand_g_ref as of IMPUTE v2.2. |
Filtering options
The options in this table affect the way that the program filters the input data. Some of the options provide direct control over which samples and SNPs get included in the analysis, while others set rules for how the program should behave when faced with certain filtering choices. These options are designed to make filtering more flexible, so that it is easy to apply any desired set of filters to a single underlying genotype file.
Some of these options apply to the dataset as a whole while others apply only to specific panels. The flag name for each panel-specific option ends in the command-line symbol for the file on which it operates; e.g., to exclude SNPs from the -g file you should use -exclude_snps_g, and to exclude SNPs from the -g_ref file you should use -exclude_snps_g_ref.
Flag | Default | Description |
---|---|---|
-filt_rules_l ... | none | This option provides flexible variant filtering in the reference panel via "filter rules", which are based on annotation columns in a -l file. Each column should be labeled by a contiguous string (no whitespace) describing its contents. For example, theExample/ directory in the software download packages includes a file named example.chr22.1kG.annot.legend that contains columns named eur.maf and afr.maf and TYPE. To filter variants based on the numeric annotation values in the -l file, you should combine a column string with a cutoff value and one of these six comparison operators: < <= > >= == != . For example, writing -filt_rules_l 'eur.maf<0.05'** on the command line would tell the program to remove any variants with **eur.maf** values less than 0.05 from the reference panel. You can include an arbitrary number of filtering strings after the**-filt_rules_l** option, in which case the filtering conditions will be applied in 'or' fashion: if any condition is true, the variant will be removed. It is very important that you enclose each filtering string in **single** quotes, as shown above. Otherwise, the command-line environment may interpret symbols like **<** and **> as linux redirection operators. There should be no white space within the single quotes. You can develop annotations yourself and add them to the -l file, or you can use the annotations that we provide in some of our reference download packages. For example, we have included continent-level minor allele frequencies in the legend files for the 1,000 Genomes Phase 1 integrated variant reference panel. For an illustration of using -filt_rules_l in practice, see this example command. |
-exclude_snps_g | none | List of SNPs to exclude from the -g file. The list should take the form of a single column of identifiers in a text file. The SNPs can be identified by their SNP IDs (first column of -g file), their rsIDs (second column of -g file), or their base pair positions (third column of -g file). Excluded SNPs will be treated as if they had not been present in the genotypes file, and they will not be shown in the output unless you use the -impute_excluded option. |
-exclude_snps_g_ref | none | Same as -exclude_snps_g, but applies to the -g_ref file. |
-impute_excluded | Specifies that SNPs excluded from the study dataset via the -exclude_snps_g option should be imputed and included in the output file. When this flag is not activated, excluded SNPs are simply ignored. | |
-include_snps | none | List of reference-panel-only SNPs to impute. If you do not want the program to impute all of the reference SNPs in the region you are analyzing, you can use this list to specify a subset of SNPs to impute; all other SNPs will be ignored unless they have data in the -g file. The list should take the form of a single column of identifiers in a text file. The SNPs can be identified by their SNP IDs (first column of -g_ref file), their rsIDs (second column of -g_ref file or first column of -l file), or their base pair positions (third column of -g_ref file or second column of -l file). This option does not have any effect on SNPs in the -g file. |
-sample_g | none | File of sample IDs for the individuals in the -g file; should follow the format described here. Only the first two columns are necessary, but they must be present and labeled "ID_1" and "ID_2". NOTE: Currently, the only reason to provide a sample file is if you want to exclude some individuals via the -exclude_samples_g option, or if you are analyzing chromosome X data via the-chrX option. |
-sample_g_ref | none | Same as -sample_g, but applies to the -g_ref file. |
-exclude_samples_g | none | List of samples to exclude from the -g file. The list should take the form of a single column of identifiers in a text file. The samples can be identified by the IDs in either of the first two columns of the -sample_g file, which is REQUIRED if you want to use this option. Excluded samples will be treated as if they had not been present in the genotypes file, and the program will re-print the original sample list, minus the excluded samples, to a file named "[-o]_samples", where -o is the name of the main output file. NOTE: Part of the IMPUTE2 algorithm involves pooling information across the individuals in your study dataset. Samples with systematically aberrant genotypes (due, e.g., to degraded assay DNA) can confuse this part of the model; you should take care to identify such samples ahead of time and exclude them either manually or with this option. |
-exclude_samples_g_ref | none | Same as -exclude_samples_g, but applies to the -g_ref file. One difference is that the program will not print a filtered list of -g_ref samples like the one that gets printed with -exclude_samples_g. |
MCMC options
IMPUTE2 uses an MCMC algorithm to integrate over the space of possible phase reconstructions for observed genotypes. The options in this table control the algorithm.
Flag | Default | Description |
---|---|---|
-iter | 30 | Total number of MCMC iterations to perform, including burn-in. Increasing the number of iterations may improve accuracy slightly, although increasing -k generally leads to greater improvements for a fixed computational cost. |
-burnin | 10 | Number of MCMC iterations to discard as burn-in. The algorithm samples new haplotypes for unphased individuals during each of the first [-burnin] iterations, but these iterations do not contribute to the final imputation probabilities. We have found that 10 burn-in iterations is enough to ensure good results in a variety of different datasets. |
-k | 80 | Number of haplotypes (in the reference or study data) to use as templates when phasing observed genotypes. Increasing this value will lead to higher accuracy at the cost of longer running times, which scale quadratically with -k. The default value should be sufficient for most analyses. |
-k_hap | 500 | Number of reference haplotypes to use as templates when imputing missing genotypes. As a rule of thumb, you should set -k_hap to the number of reference haplotypes that you expect to be useful for your study population. If this value is less than the total number of haplotypes in your reference panel, IMPUTE2 will choose a "custom" set of -k_hap haplotypes each time it imputes missing alleles in a study haplotype. If all of your reference haplotypes have similar ancestry to the subjects in your study, each haplotype is potentially useful for imputation, so the best accuracy can be achieved by setting -k_hap to the total number of reference haplotypes. Using smaller values will decrease the running time linearly while incurring a slight loss of accuracy. Conversely, we now recommend running IMPUTE2 with large reference panels containing haplotypes of diverse ancestry. (For more details, see here.) In this context, our rule of thumb suggests setting -k_hap to be smaller than the total size of the reference panel. Imputation accuracy is robust to different values of -k_hap within a sensible range, so it should usually be sufficient to choose a value by intuition. When in doubt, we suggest that you err on the side of making -k_hap too large, since we often find that diverse reference panels contain more useful haplotypes than one might expect. As of software version 2.3.0, -k_hap can accept two values when you are imputing from two reference panels�for example, '-k_hap 500 200'. In this context, the first value is the number of haplotypes to be chosen from Panel 0 and the second value is the number to be chosen from Panel 1. This flexibility can be useful when merging reference panels. |
Pre-phasing options
You can greatly speed up your imputation through a process called "pre-phasing". The idea of this approach is to first phase your GWAS genotypes, then use the estimated GWAS haplotypes to impute untyped variants from a reference panel. The options in this table activate the corresponding functionality in IMPUTE2. You can see how these options are applied in this example command.
Flag | Default | Description |
---|---|---|
-prephase_g | Tells IMPUTE2 to phase the genotypes in the -g file. The estimated haplotypes are printed to a dedicated output file named "[-o]_haps", where [-o] is the name supplied for the main output file. To avoid edge effects in downstream imputation, IMPUTE2 will extend the estimated haplotypes into the buffer regions that flank the main region specified via -int. | |
-use_prephased_g | Tells IMPUTE2 to perform imputation with pre-phased GWAS haplotypes, which must be supplied via a -known_haps_g file. This file will often be produced by a pre-phasing run that used -prephase_g on the same imputation interval (-int), although it may also come from a different phasing algorithm like SHAPEIT, which can print haplotypes in -known_haps_g format. We now recommend using SHAPEIT for pre-phasing and IMPUTE2 for downstream imputation. |
Panel merging options
These options allow IMPUTE2 to efficiently combine two reference panels typed on partially overlapping sets of variants.
Flag | Default | Description |
---|---|---|
-merge_ref_panels | Tells the program to combine information across two reference panels using the approach described here. | |
-merge_ref_panels_output_ref | none | Activates -merge_ref_panels and tells the program to store the merged panel in two output files: a legend file named .legend and a haplotype file named .hap. |
-merge_ref_panels_output_gen | none | Activates -merge_ref_panels and tells the program to store the merged panel in .gen format in an output file named .gen. |
NOTE: If you want IMPUTE2 to print a merged reference panel with buffer regions included, you should use one of the last two options together with the -include_buffer_in_output flag.
NOTE: You can see an example run that uses -merge_ref_panels here.
Chromosome X options
These options facilitate the analysis of genotype data from human chromosome X.
Flag | Default | Description |
---|---|---|
-chrX | Specifies that this is an analysis of chromosome X data. This flag changes the model parameters by automatically reducing the -Ne value by 25%, and it allows the -g file to include a mixture of dizygous females and hemizygous males. When using the -chrX option, it is essential to provide a -sample_g file with a column named 'sex', since this tells the program which individuals are males and which are females. More details on the file formats for chromosome X analysis are available here, and you can see an example run here. | |
-Xpar | Specifies that the current dataset comes from a pseudoautosomal region (PAR) of chromosome X, where both males and females are diploid. When used together with -chrX, this flag will reduce -Ne by 25% but otherwise run the analysis in the same way as on the autosomes. |
Expert options
The options in this table are meant for experts only. Don't use them unless you know what you are doing!
Flag | Default | Description |
---|---|---|
-seed | random | Initial seed for random number generator. The seed is set using the system clock unless it is manually overridden with this option. |
-no_warn | Turns warnings off, so that the -w file does not get printed. | |
-fill_holes | Turns on the "hole-filling" function, which allows SNPs that are typed in the -g file but not in the lowest reference panel to contribute to the inference. | |
-no_remove | Prevents the program from discarding SNPs whose alleles cannot be aligned across panels. Such SNPs will be retained in the output, but they will not be used for inference. |
Best Practices for Imputation
IMPUTE2 includes a rich collection of functions for analyzing genetic datasets, but it is most commonly used to perform genotype imputation in genome-wide association studies. To help investigators perform this kind of analysis, we have condensed the information on this website into a list of current best practices.
Pre-imputation filtering of study genotypes
Before you perform an imputation run with your study genotypes, you should filter the data to remove low-quality variants and individuals, as these can degrade the accuracy of the final results. Standard GWAS quality control filters are usually sufficient to prepare a dataset for imputation. It may also help to add an imputation-based QC step to the filtering process; we will describe this approach in the near future.
Variant position matching across input files
When you provide IMPUTE2 with reference and study data, the program determines which variants are shared across datasets by looking at their positions on the chromosome (as opposed, say, to their rsIDs). If two or more variants have the same position�perhaps because one is a SNP and one is an overlapping INDEL�then these variants are matched across panels based on their allele labels.
It is important to note that genomic coordinates change every couple of years as the human genome reference sequence is updated, so a given SNP may have different positions in different datasets. In order to obtain high-quality results fromIMPUTE2, you must make sure that the variant positions in your input files are mapped to the same coordinate system, or "assembly".
Genomic assemblies are typically identified by their NCBI build number (e.g., "b36" or "b37") or their UCSC version (e.g., "hg18" or "hg19"). Our reference data download section shows the assembly to which each reference panel is mapped. If your study genotypes come from a different assembly than your reference panel, you should map the positions in your data to the reference coordinate system by using a tool like the liftOver program from UCSC. If you need help with this step, please send a message to our mail list.
Strand alignment between study and reference data
It is absolutely essential to align your study genotypes to the same strand convention as the reference panel from which you are imputing. Variants that are aligned to different strands may have different alleles (e.g., A/G in one dataset and T/C in another) or the same alleles at disparate frequencies (e.g., A/T in two datasets, where the 'A' allele occurs at 5% frequency in one dataset and 95% frequency in the other), and either of these scenarios can decrease imputation quality.
Most publicly available reference panels are aligned to the '+' strand of the human genome reference sequence, so the goal is to align your genotypes to the same convention. The best way to do this is to obtain assay information from the vendor who provided your genotypes; once you have this information, you can align your genotypes either manually or with the options described here. If you cannot recover the strand alignment from the original assay, you can use other options that tell IMPUTE2 to make educated guesses.
Choosing a reference panel
Historically, most GWAS investigators have tried to choose reference panels that match the ancestry of their study samples. We have developed a different approach: first supply IMPUTE2 with a worldwide reference panel, then let the program decide which haplotypes to use for imputation. This strategy can increase accuracy at low-frequency variants, and it avoids difficult choices about which haplotypes to include in the reference set. We currently recommend this approach for imputing genotypes in any human population. You can read our paper on this strategy here, learn about practical ways of applying it here, and download state-of-the-art reference haplotypes here.
If you have collected a custom reference panel for your study population�say, exome-wide or genome-wide sequencing data�you can combine it with the 1,000 Genomes data to maximize accuracy and genomic coverage at the same time. To learn how IMPUTE2 does this, see here.
Genome-wide imputation
It can be complicated and computationally demanding to impute thousands of individuals across the entire genome. We provide a few mechanisms to help with this process:
- IMPUTE2 includes command-line parameters that can be used to split the genome into discrete chunks for parallel analysis on a computing cluster. These parameters allow flexible partitioning of the genome with minimal manipulation of input files. See here for suggestions on how to use this functionality.
- IMPUTE2 is an efficient imputation method, but it still requires substantial computing time to process the whole genome in a large number of individuals. We have recently developed an approach called "pre-phasing" that greatly reduces the computational burden of imputation while sacrificing only a little accuracy; you can read more about the approach here. We now recommend this as the standard way of performing genome-wide imputation, although we still prefer the originalIMPUTE2 MCMC algorithm for maximizing accuracy in smaller regions.
- Sequence-based reference panels contain large numbers of rare and low-frequency variants, which can drive up the computational cost of imputation. When computing power is limited, it may be desirable to remove some of these variants (e.g., those with very low frequencies in the population of interest) before running imputation. To facilitate this process, we have added the -filt_rules_l option, which can flexibly remove reference variants based on command-line input to an IMPUTE2 run. You can see an example application of this approach and some guidelines for using it here.
Post-imputation filtering
It is standard practice to perform additional filtering once a batch of imputation runs has completed, mainly to remove poorly imputed variants that might behave badly in association tests. We are currently preparing some recommendations for this process; we will post them on the website as soon as they are ready.
Association testing
We distribute a program called SNPTEST that contains a powerful suite of statistical tests for association between phenotypes and imputed genotypes. You can download the software and read more about its functions at the SNPTEST website.
Follow-up imputation of putative associations
Once you have performed genome-wide imputation and association testing, you may want to take a closer look at regions with interesting associations. To get the best possible results, we recommend re-imputing this subset of regions with more intensive program settings:
- In contrast to the pre-phasing approach that we recommend for genome-wide imputation, we suggest using the standard IMPUTE2 MCMC algorithm for follow-up imputation. This method takes longer to run in each region, but it should lead to slightly higher accuracy (especially at low-frequency variants) and remain computationally feasible when run on a limited portion of the genome.
- If time permits, the overall accuracy may be improved by increasing the value of the -k parameter.
- If time permits, the accuracy at low-frequency variants may be improved by increasing the size of the -buffer region�say, from the default value of 250 kb to 1000 kb (1 Mb).
Once you have re-imputed each region of interest, you should perform the association tests again to obtain a high-resolution estimate of the association landscape.
Pre-Phasing GWAS
Improvements in sequencing and genotyping technologies have rapidly increased the amount of reference data that can be used to impute untyped SNPs in association studies. Larger reference panels improve the power and resolution of imputation-based association mapping, but they also increase the computational burden of imputation. To help offset this cost, we have developed an extension of the IMPUTE2 methodology.
The basic idea is to "pre-phase" your study genotypes to produce best-guess haplotypes, then impute into these estimated haplotypes in a separate program run. By contrast, the originalIMPUTE2 method integrates over the unknown phase of your study data during the course of an imputation analysis. Pre-phasing leads to a small loss of accuracy since the estimation uncertainty in the study haplotypes is ignored, but this allows for very fast imputation. This speedup is especially important because modern reference collections (such as those from the 1,000 Genomes Project) are frequently updated and expanded, so that many investigators would benefit from "re-imputing" their datasets following each reference panel update. The pre-phasing step needs to be performed just once per study dataset, so re-imputing is computationally cheap.
For these reasons, we now recommend pre-phasing as the standard approach for genotype imputation in genome-wide association studies, with the original IMPUTE2 algorithm reserved for maximizing accuracy in more targeted analyses. Pre-phasing is implemented through three program options: -prephase_g,-use_prephased_g, and -known_haps_g. The best way to learn how to use this approach is by example.
We recommend performing the pre-phasing step with an accurate phasing method called SHAPEIT2 (details here and here), then imputing into the estimated GWAS haplotypes with IMPUTE2.
If you use this functionality in your study, please remember to cite our article about pre-phasing in GWAS and the originalIMPUTE2 article.
Analyzing Whole Chromosomes
In principle, it is possible to impute genotypes across an entire chromosome in a single run of IMPUTE2. However, we prefer to split each chromosome into smaller chunks for analysis, both because the program produces higher accuracy over short genomic regions and because imputing a chromosome in chunks is a good computational strategy: the chunks can be imputed in parallel on multiple computer processors, thereby decreasing the real computing time and limiting the amount of memory needed for each run.
We therefore recommend using the program on regions of ~5 Mb or shorter, and versions from v2.1.2 onward will throw an error if the analysis interval plus buffer region is longer than 7 Mb. People who have good reasons to impute a longer region in a single run can override this behavior with the -allow_large_regions flag.
The -int parameter provides an easy way to break a chromosome into smaller chunks for analysis by IMPUTE2. For example, if we wanted to split a chromosome into 5-Mb regions for analysis, we could specify "-int 1 5000000" for the first run of the algorithm, "-int 5000001 10000000" for the second run, and so on, all without changing the input files. IMPUTE2 uses an internal buffer region of 250 kb on either side of the analysis interval to prevent edge effects; this means that data outside the region bounded by -int will contribute to the inference, but only SNPs inside that region will appear in the output. In this way, you can specify non-overlapping, adjacent intervals and obtain uniformly high-quality imputation. (Note: to change the size of the internal buffer region, use the -buffer option.)
Once you have split a chromosome into multiple chunks and imputed them separately, the IMPUTE2 output format makes it easy to synthesize your results into a single whole-chromosome file. On linux-based systems, you can simply type a command like this:
cat chr16_chunk1.impute2 chr16_chunk2.impute2 chr16_chunk3.impute2 > chr16_chunkAll.impute2
Here, "chr16_chunkX.impute2" is an output file for one chunk of chromosome 16, and "chr16_chunkAll.impute2" is a combined output file that contains results for the entire chromosome. (Note that chr16 would typically need to be split into more than three chunks to satisfy the approximation used by IMPUTE2.)
Merging Reference Panels
Problem statement
Modern genotyping and sequencing technologies are generating a variety of reference datasets that can be used for genotype imputation in association studies. Combining reference panels from different populations can often improve imputation accuracy (e.g., see Howie et al. 2011), but it is not clear how best to merge panels that are genotyped at different sets of variants.
Howie_et al._ 2009 proposed a solution for the special case where one reference panel contains a subset of the variants in another reference panel. We previously released a combined 1,000 Genomes + HapMap 3 panel that takes advantage of this framework, and it was also used in the WTCCC2 studies.
Many association studies are now using the latest 1,000 Genomes data to drive their genotype imputation, but they may also have sequenced additional individuals from the population being studied. It makes sense to combine these resources in order to use all available reference information, but in this case each reference panel will contain many variants that are not found in the other�that is, the "hierarchical" variant framework of Howie et al. 2009 no longer applies.
With this in mind, we have devised a new strategy for combining reference panels created by different sequencing or genotyping studies.
Our approach
There are many possible ways to merge two reference panels. We are exploring several of these options, but we decided to start with the simple approach depicted in the figure below. The top panel of this figure shows two reference panels and a GWAS cohort; you can think of the rows as individuals and the columns as positions along the genome. Each vertical line represents a genotyped variant in a given panel, and each reference panel includes variants that are not found in the other.
We impute the untyped variants in this figure in three steps:
- Impute the variants that are specific to Panel 0 (red) into Panel 1 (blue). Variants shown in grey do not inform the imputation.
- Impute the variants that are specific to Panel 1 (blue) into Panel 0 (red). Variants shown in grey do not inform the imputation.
- Now that we have imputed the two reference panels up to the union of their variants, treat the imputed haplotypes as known (i.e., take the best-guess haplotypes) and impute the GWAS cohort in the usual way.
This process can be performed with IMPUTE2 (version 2.3 and later) in a streamlined way: all you have to do is add the -merge_ref_panels flag to the command line. You can see a working example commandhere.
Practical considerations
Using pre-phased study data
The -merge_ref_panels flag works with both unphased study genotypes (-g file) and pre-phased study haplotypes (-known_haps_g file).
Parameter settings
For finer control of the merging step, you can supply two values to -k_hap on the command line�for example, '-k_hap 500 200'. This setting tells IMPUTE2 to use 500 haplotypes from Panel 0 and 200 haplotypes from Panel 1. These values should reflect the number of haplotypes in each panel that you expect to be useful for imputation in the study population, which could be less than the total number if either panel is multi-ethnic.
Reference panel ordering
The order in which you supply the reference panels on the command line should not affect the accuracy of imputation from the merged panel: inside the program, the calculations are completely symmetric. One practical limitation is that only the first legend file in an IMPUTE2 command is allowed to have more than four columns. The 1,000 Genomes legend files we distribute typically have more than four columns, so if you are using these files it makes sense to provide the 1,000 Genomes panel before your other panel on the command line.
Printing the merged panel
By default, IMPUTE2 does not print the merged reference panel (the outcome of Steps 1 and 2 above); the merging is done internally, and the output shows only the imputed genotypes for the study cohort. If you want the program to output the merged panel, you can replace -merge_ref_panels with one of two options:
- -merge_ref_panels_output_ref�This option tells the software to merge the two reference panels and print the results in IMPUTE2 reference file format: one legend file and one haplotypes file. See the link for more information.
- -merge_ref_panels_output_gen�This option tells the software to merge the two reference panels and print the results in IMPUTE2 .gen file format. Phase information is ignored when creating this file, which can be useful if you want to re-phase the merged reference panel. See the link for more information.
If you want to merge two reference panels without imputing into a study dataset (i.e., to skip Step 3 above), you should use one of these two options and omit the study data (-g file or-known_haps_g file) from your IMPUTE2 command.
Normally, these options print the merged reference panel within the region specified by the -int argument. If you want to include the buffer regions in the output, you should add the -include_buffer_in_output flag to your command line statement.
Publication and citation
Our approach for merging reference panels has not yet been published outside this website. We have tested the method on realistic datasets, and it has performed well in all of our analyses. We are actively working to document our work on this approach and to compare it with other strategies; we aim to report the results of these experiments and the details of our methodology as soon as possible.
In the meantime, we are happy to answer thoughtful questions and to hear about your experiences with this new functionality. If you would like to send comments, please do so through our mail list.
Imputation Concordance Tables
What is a concordance table?
Every run of IMPUTE2 produces a concordance table, except under certain settings that are not commonly used. A concordance table shows the results of an internal cross-validation that the program performs automatically. For this analysis, IMPUTE2 masks the genotypes of one variant at a time in the study data (Panel 2), then imputes the masked genotypes with information from the reference data and nearby study variants. The imputed genotypes are then compared with the original genotypes to evaluate the quality of the imputation. The results are summarized in a table like the one below:
If you are interested in the results of this experiment at a given variant, you can find this information in the _info file printed by IMPUTE2. The concord_typeX column shows the concordance between input genotypes and best-guess imputed genotypes at each variant, while the r2_typeX column gives the squared correlation between input genotypes and expected genotypes (or "dosages") from imputation. Note that the cross-validation cannot be performed at variants that were not provided in a Panel 2 input file (-g or -known_haps_g), so reference-only variants are assigned values of -1 in the _info file. To learn more about the format of this output file, see here.
How are concordance tables made?
Only variants with input data from a -g or -known_haps_g file are masked and imputed in this analysis. When a -known_haps_g file is provided, all input genotypes are treated as being true. When a -g file is provided, we make hard genotype calls by applying a threshold (default = 0.9) to the maximum value in each input probability triple. For example, a genotype with P(G=0,1,2) = (0.03, 0.95, 0.02) would be called as a '1' (heterozygous), while a genotype with P(G=0,1,2) = (0.1, 0.7, 0.2) would be left uncalled and omitted from the concordance calculations.
The genotype probabilities from imputation are used somewhat differently. In the first three columns of the table, we assign each imputed genotype to a bin (Interval) based on its maximum posterior probability. Then, for each bin we report the number of imputed genotypes that passed the calling threshold in the input data (#Genotypes). We then convert the imputed probabilities to 'best-guess' genotypes: for each posterior probability triple, we select the genotype with the highest value, regardless of magnitude. Finally, we compare the input genotype calls with the best-guess imputed genotypes and report the concordance (%Concordance) within each bin.
In the last three columns of the table, we again bin the imputed genotypes based on their maximum posterior probabilities, but this time the binning is cumulative: the bin at the bottom of the table includes only genotypes that were confidently imputed (max prob >= 0.9), while each bin above includes all genotypes that pass a more lenient certainty threshold. These thresholds are shown in the fourth column (Interval). The fifth column (%Called) shows the percentage of imputed genotypes that pass a given probability threshold, where the denominator is the total number of imputed genotypes for which hard calls are available in the input data. The sixth column (%Concordance) shows the percentage of imputed genotypes in a given bin that match the masked input genotypes.
What can I do with a concordance table?
We can learn a couple of things from this kind of analysis. First, the results can alert us to problems in the imputation: if the concordance between imputed and input genotypes is abnormally low, it may indicate that something went wrong in the analysis or input files. A useful summary statistic is the number in the upper righthand corner of the table, which gives the overall concordance from the cross-validation. This number should typically be around 95%; it may be lower in certain populations or regions of the genome, but if it is much lower then you may need to double-check the analysis. If you are worried about your results, please send a message with details of your analysis (including a _summary output file from IMPUTE2) to our mail list.
Concordance tables can also be used to predict the general quality of imputed genotypes at SNPs where we do not know the true genotypes. SNPs on GWAS microarrays tend to be easier to impute than untyped SNPs of the same frequency, so the cross-validation results may be somewhat optimistic, but they are often useful for relative comparisons�say, between different parameter settings of IMPUTE2.
Finally, the per-variant results of the cross-validation in the output _info file can help identify poorly genotyped SNPs and strand flips. For example, an input SNP that has a low concord_typeX value (implying that the imputed genotypes do not agree with the original genotypes) and a high info_typeX value (implying that the imputation is confident) might be worth investigating or removing from subsequent imputation runs.
Multiple reference panels
If you provide two reference panels to IMPUTE2, the program will perform the cross-validation in two different ways. First it will use only a single reference panel (Panel 0) to mimic Type 0 SNPs, and then it will use both reference panels together (Panels 0 and 1) to mimic Type 1 SNPs. In this case, IMPUTE2 will print two concordance tables�one for each type of reference SNP. Note that the same masked study genotypes are used to evaluate accuracy in both cases; the only difference is how much reference data we allow the program to see when imputing the masked genotypes.
Where can I find the concordance table?
The concordance table is printed at the end of an IMPUTE2 run. One copy is printed to STDOUT, and another copy is printed in the _summary output file.
Scripts
The following scripts are designed to help with various parts of an IMPUTE2 analysis. We provide them in the hope that they will be useful, but we do not offer software support for them, and we cannot guarantee that they will work on your data due to inconsistencies in file formats, assumptions, etc. If you want to use one of these scripts, we suggest that you first read through the code to understand how it works.
All of these scripts are released under the GNU General Public License. Each script will print a list of command line options if you run it with no arguments.
Script name | Function |
---|---|
vcf2impute_legend_haps.pl | Convert a phased VCF file into reference panel format: one legend file and one haplotypes file. |
vcf2impute_gen.pl | Convert a phased or unphased VCF file into genotype file format (.gen). |
FAQ
Our FAQ has moved to this Google document.
References
[1] J. Marchini, B. Howie, S. Myers, G. McVean, and P. Donnelly (2007) A new multipoint method for genome-wide association studies via imputation of genotypes. Nature Genetics 39: 906-913 [Free Access PDF] [Supplementary Material] [News and Views Article]
[2] B. N. Howie, P. Donnelly, and J. Marchini (2009) A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genetics 5(6): e1000529 [Open Access Article] [Supplementary Material]
[3] J. Marchini and B. Howie (2010) Genotype imputation for genome-wide association studies. Nature Reviews Genetics 11: 499-511 [Restricted Access PDF] [Supplementary Material]
[4] B. Howie, J. Marchini, and M. Stephens (2011) Genotype imputation with thousands of genomes. G3: Genes, Genomics, Genetics 1(6): 457-470 [Open Access Article] [Supplementary Material]
[5] B. Howie, C. Fuchsberger, M. Stephens, J. Marchini, and G. R. Abecasis (2012)Fast and accurate genotype imputation in genome-wide association studies through pre-phasing. Nature Genetics 44(8): 955-959 [Restricted Access PDF]
Contributors
The following people developed the methodology and software for IMPUTE2:
Mail List
If you have a question about IMPUTE2, please send a message to our mailing list:
http://www.jiscmail.ac.uk/OXSTATGEN
You will need to subscribe to the mailing list to post a question. The list has low but steady traffic, so you may want to redirect the messages to a dedicated e-mail folder if you don't want them all landing in your inbox.
IMPORTANT: If you are having a problem with the software, please include the following details in your e-mail; otherwise, we may not be able to diagnose the problem.
- The version number of IMPUTE2 and the type of computer you are using to run it�e.g., "IMPUTE v2.2.2 on Mac OSX 10.6".
- Any log files and/or screen output from the program; e.g., the "_summary" output file.
- For difficult problems like memory access errors (e.g., "segmentation faults"), we may need you to send data files that show the problem. These files should ideally be small, and we can provide suggestions if you are not allowed to share your actual data.