Mapping trait loci by use of inferred ancestral recombination graphs - PubMed (original) (raw)

Mapping trait loci by use of inferred ancestral recombination graphs

Mark J Minichiello et al. Am J Hum Genet. 2006 Nov.

Abstract

Large-scale association studies are being undertaken with the hope of uncovering the genetic determinants of complex disease. We describe a computationally efficient method for inferring genealogies from population genotype data and show how these genealogies can be used to fine map disease loci and interpret association signals. These genealogies take the form of the ancestral recombination graph (ARG). The ARG defines a genealogical tree for each locus, and, as one moves along the chromosome, the topologies of consecutive trees shift according to the impact of historical recombination events. There are two stages to our analysis. First, we infer plausible ARGs, using a heuristic algorithm, which can handle unphased and missing data and is fast enough to be applied to large-scale studies. Second, we test the genealogical tree at each locus for a clustering of the disease cases beneath a branch, suggesting that a causative mutation occurred on that branch. Since the true ARG is unknown, we average this analysis over an ensemble of inferred ARGs. We have characterized the performance of our method across a wide range of simulated disease models. Compared with simpler tests, our method gives increased accuracy in positioning untyped causative loci and can also be used to estimate the frequencies of untyped causative alleles. We have applied our method to Ueda et al.'s association study of CTLA4 and Graves disease, showing how it can be used to dissect the association signal, giving potentially interesting results of allelic heterogeneity and interaction. Similar approaches analyzing an ensemble of ARGs inferred using our method may be applicable to many other problems of inference from population genotype data.

PubMed Disclaimer

Figures

Figure  1.

Figure 1.

The ARG. A, Example ARG for four chromosome sequences. The sequences label the leaves of the ARG and are written as strings of 0s and 1s (coding SNP alleles). Moving backward in time (up the ARG), one first encounters a mutation. A mutation is denoted by a black dot and a number specifying its marker position. The second event is a recombination between markers 2 and 3. As one works backward in time, this corresponds to splitting a lineage into two, with the alleles at positions 1 and 2 following the left lineage and the allele at position 3 following the right lineage. After this is a coalescence, merging two lineages into one, and so on, to the grand common ancestor. B, Marginal tree for the SNPs at positions 1 and 2. C, Marginal tree for the SNP at position 3. To test a marginal tree for disease association, mutations are dropped onto each of the branches in turn, defining hypothetical allelic states of the leaves, which can then be tested for statistical association with the phenotype. The black dot labeled “2” best segregates the cases (D) from the controls (U) and would be identified as the most likely causative mutation event. D–G, Logic behind the ARG inference algorithm. D, The two sequences have a shared tract over the region formula image. E, To coalesce over the tract region, we must add a recombination breakpoint to the right of it—that is, between positions 2 and 3. This results in two parent sequences. F, We let undefined material (denoted by

·

) coalesce with anything. We can now coalesce the left recombination parent and the other sequence. G, We can add a mutation.

Figure  2.

Figure 2.

Analysis of a suite of case-control studies with disease parameters

GRR(Aa)=2

,

GRR(AA)=3

,

_q_=0.04

, and

n _cc_=2,000

, sampled from the constant population with the full ascertainment tSNP set. A, Association structure for a simulated case-control study.

denotes the position of the (untyped) causative SNP. _B,_Probability of there being a significant association within an interval around the causative SNP. C and D, Markerwise P values at the marker closest to the causative SNP for each of the 50 studies. E, Cumulative distribution of distances between the association peak and the causative SNP. F, Distribution of estimated allele frequency.

Figure  3.

Figure 3.

Marginal tree correlation versus

_r_2

LD for part of ENCODE region 7p15.2 from the phase I HapMap. Tree correlation is measured as the proportion of the

_n_-3

nonequivalent, nonunary bipartitions of the leaves of each tree (defined by cutting branches) that are shared between trees at different positions.

Figure  4.

Figure 4.

Power, localization, and interpretation for a range of disease models. Each point on the _X_-axis corresponds to a suite of 50 studies. Each of the disease parameters is varied between suites, whereas the other parameters are held at “default” values of

GRR(Aa)=2

,

GRR(AA)=2×GRR(Aa)-1

,

_q_=0.04

, and

n _cc_=2,000

. All studies are sampled from the constant population with the full ascertainment tSNP set. A, Probability of an experimentwise significant signal within 100 kb of the causative SNP (calculated as the proportion of studies in each suite that meet this criterion). B, Probability that the association peak is within 100 kb of the causative SNP. C, Estimated causative-allele frequency versus true frequency q.

Figure  5.

Figure 5.

Localization for different data, populations, and tSNP models. A, Performance on a suite of case-control studies with

GRR(Aa)=2

,

GRR(AA)=3

,

_q_=0.04

, and

n _cc_=2,000

, sampled from the constant population and with the full ascertainment tSNP set. Margarita is applied to this suite under three scenarios: when the data are phased, when they are unphased, and when they are phased but have 10% missing data. B, Performance on a suite of case-control studies sampled from the hot population (and with

GRR(Aa)=2

,

GRR(AA)=3

,

_q_=0.04

, and

n _cc_=2,000

). Performance is compared using three different tSNP ascertainment schemes (described in the “Methods” section).

Figure  6.

Figure 6.

Analysis of the CTLA4 data. A, Association structure of the region. B, Distribution of estimated causative-allele frequency, by use of marginal trees at CT60. C, Test for allelic heterogeneity, by calculation of the proportion of inferred marginal trees at each position for which a chromosome appears under the branch that best segregates the cases and controls. D, Association structure for a subset of the CTLA4 data—only those chromosomes with the protective CT60 allele.

Similar articles

Cited by

References

Web Resources

    1. BARGEN, http://www.ebi.ac.uk/projects/BARGEN (for downloading the FREGENE simulator)
    1. Margarita, http://www.sanger.ac.uk/Software/analysis/margarita (for downloading the Java program Margarita, plus high-resolution versions of the figures from this article)
    1. Online Mendelian Inheritance in Man (OMIM), http://www.ncbi.nlm.nih.gov/Omim/ (for CTLA4 and Graves disease)

References

    1. Cordell HJ, Clayton DG (2005) Genetic association studies. Lancet 366:1121–113110.1016/S0140-6736(05)67424-7 - DOI - PubMed
    1. Palmer LJ, Cardon LR (2005) Shaking the tree: mapping complex disease genes with linkage disequilibrium. Lancet 366:1223–123410.1016/S0140-6736(05)67485-5 - DOI - PubMed
    1. Devlin B, Risch N (1995) A comparison of linkage disequilibrium measures for fine-scale mapping. Genomics 29:311–32210.1006/geno.1995.9003 - DOI - PubMed
    1. Pritchard JK, Przeworski M (2001) Linkage disequilibrium in humans. Am J Hum Genet 69:1–14 - PMC - PubMed
    1. Nordborg M, Tavaré S (2002) Linkage disequilibrium: what history has to tell us. Trends Genet 18:83–9010.1016/S0168-9525(02)02557-X - DOI - PubMed

Publication types

MeSH terms

Substances

LinkOut - more resources