An Arabidopsis example of association mapping in structured samples - PubMed (original) (raw)

An Arabidopsis example of association mapping in structured samples

Keyan Zhao et al. PLoS Genet. 2007.

Abstract

A potentially serious disadvantage of association mapping is the fact that marker-trait associations may arise from confounding population structure as well as from linkage to causative polymorphisms. Using genome-wide marker data, we have previously demonstrated that the problem can be severe in a global sample of 95 Arabidopsis thaliana accessions, and that established methods for controlling for population structure are generally insufficient. Here, we use the same sample together with a number of flowering-related phenotypes and data-perturbation simulations to evaluate a wider range of methods for controlling for population structure. We find that, in terms of reducing the false-positive rate while maintaining statistical power, a recently introduced mixed-model approach that takes genome-wide differences in relatedness into account via estimated pairwise kinship coefficients generally performs best. By combining the association results with results from linkage mapping in F2 crosses, we identify one previously known true positive and several promising new associations, but also demonstrate the existence of both false positives and false negatives. Our results illustrate the potential of genome-wide association scans as a tool for dissecting the genetics of natural variation, while at the same time highlighting the pitfalls. The importance of study design is clear; our study is severely under-powered both in terms of sample size and marker density. Our results also provide a striking demonstration of confounding by population structure. While statistical methods can be used to ameliorate this problem, they cannot always be effective and are certainly not a substitute for independent evidence, such as that obtained via crosses or transgenic experiments. Ultimately, association mapping is a powerful tool for identifying a list of candidates that is short enough to permit further genetic study.

PubMed Disclaimer

Conflict of interest statement

Competing interests. The authors have declared that no competing interests exist.

Figures

Figure 1

Figure 1. Summary of the Data Illustrating Strong Positive Correlations between Phenotypes and between Phenotypes and Genome-Wide Relatedness

The panel on the left gives the basic phenotypes used (see Table 1), with colors indicating relative values within each phenotype (white denotes missing data). The tree on the right shows a hierarchical clustering (UPGMA) of accessions based on their relative kinship (as measured by pairwise haplotype sharing). Colors for the accessions labels indicate geographic origins.

Figure 2

Figure 2. The Cumulative Distribution of _p_-Values in Genome-Wide Scans for All Phenotypes

The different curves correspond to different approaches for correcting for population structure (described in Table 2). Without correction for population structure, all distributions are strongly skewed towards significance. The results shown here are for associations between phenotype and individual SNPs: results for haplotypes were very similar (see Figure S2). cdf, cumulative distribution function

Figure 3

Figure 3. Power of the Different Methods to Detect a QTN

The power of the different methods to detect a QTN using a 5% significance level (based on the observed distribution of _p_-values), as a function of the fraction of the phenotypic variation attributable to the QTN. (A) Power averaged across all simulated QTN. Methods that reduce confounding more effectively have better power. (B) Power calculated separately for QTN that showed strong correlation with population structure (“high PS,” arbitrarily defined as simulations where more than 50% of the phenotypic variation could be explained by Q (note that “P” performs better than “Q” largely because of this definition) and those that did not (“low PS”). All methods are relatively powerless to detect QTN that are strongly correlated with population structure. Results for haplotypes were similar (Figure S3).

Figure 4

Figure 4. Power to Detect Associations Using a Nominal 0.1% Significance Level

The QTN is assumed to lie within a sequenced haplotype, but power is nonetheless poor even for major QTL.

Figure 5

Figure 5. Comparison of Association and Linkage Mapping Results for the Phenotype JIC4W

The position of candidate genes with higher marker density are highlighted in yellow. (A) The QTL LOD scores for two crosses: Col-0 × Lov-1 (pink curve) and Col-0 × Edi-0 (cyan curve). The dashed grey line marks a genome-wide permutation significance level of 5%. (B) Negative log _p_-values from a genome-wide scan using the Q + K model. (C) Negative log _p_-values from a genome-wide scan using a naive Kruskal-Wallis approach. (B and C) The dashed grey line marks a nominal significance level of 0.1%, and the colored stars denote associations that should correspond to QTL in the cross displayed using the same color (because the appropriate alleles are segregating; see text).

Similar articles

Cited by

References

    1. International HapMap Consortium. A haplotype map of the human genome. Nature. 2005;437:1299–1320. - PMC - PubMed
    1. Nordborg M, Borevitz JO, Bergelson J, Berry CC, Chory J, et al. The extent of linkage disequilibrium in Arabidopsis thaliana . Nat Genet. 2002;30:190–193. - PubMed
    1. Farnir F, Arranz WCJJ, Berzi P, Cambisano N, Grisart B, et al. Extensive genome-wide linkage disequilibrium in cattle. Genome Res. 2000;10:220–227. - PubMed
    1. Lander ES, Schork NJ. Genetic dissection of complex traits. Science. 1994;265:2037–2048. - PubMed
    1. Marchini J, Cardon L, Phillips M, Donnelly P. The effects of human population structure on large genetic association studies. Nat Genet. 2004;36:512–517. - PubMed

Publication types

MeSH terms

Grants and funding

LinkOut - more resources