Ascertainment bias in studies of human genome-wide polymorphism - PubMed (original) (raw)

Comparative Study

. 2005 Nov;15(11):1496-502.

doi: 10.1101/gr.4107905.

Affiliations

Comparative Study

Ascertainment bias in studies of human genome-wide polymorphism

Andrew G Clark et al. Genome Res. 2005 Nov.

Abstract

Large-scale SNP genotyping studies rely on an initial assessment of nucleotide variation to identify sites in the DNA sequence that harbor variation among individuals. This "SNP discovery" sample may be quite variable in size and composition, and it has been well established that properties of the SNPs that are found are influenced by the discovery sampling effort. The International HapMap project relied on nearly any piece of information available to identify SNPs-including BAC end sequences, shotgun reads, and differences between public and private sequences-and even made use of chimpanzee data to confirm human sequence differences. In addition, the ascertainment criteria shifted from using only SNPs that had been validated in population samples, to double-hit SNPs, to finally accepting SNPs that were singletons in small discovery samples. In contrast, Perlegen's primary discovery was a resequencing-by-hybridization effort using the 24 people of diverse origin in the Polymorphism Discovery Resource. Here we take these two data sets and contrast two basic summary statistics, heterozygosity and F(ST), as well as the site frequency spectra, for 500-kb windows spanning the genome. The magnitude of disparity between these samples in these measures of variability indicates that population genetic analysis on the raw genotype data is ill advised. Given the knowledge of the discovery samples, we perform an ascertainment correction and show how the post-correction data are more consistent across these studies. However, discrepancies persist, suggesting that the heterogeneity in the SNP discovery process of the HapMap project resulted in a data set resistant to complete ascertainment correction. Ascertainment bias will likely erode the power of tests of association between SNPs and complex disorders, but the effect will likely be small, and perhaps more importantly, it is unlikely that the bias will introduce false-positive inferences.

PubMed Disclaimer

Figures

Figure 1.

Figure 1.

Site frequency spectra for the fully resequenced NIEHS gene set, for the Perlegen sequencing-by-hybridization SNP ascertainment set, and for the set of SNPs that the International HapMap consortium genotyped, all contrasted to the neutral expectation (given estimates of the sample θ). Note the marked absence of rare SNPs and oversampling of SNPs of intermediate frequency in the HapMap sample.

Figure 2.

Figure 2.

(Top) Distributions of uncorrected HS (within-population heterozygosity) for the HapMap and the Perlegen data across 5682 windows of 500 kb spanning the entire human genome. Commensurate with the upward skew to the site frequency spectrum, the HapMap data have higher heterozygosity. (Bottom) After correction for ascertainment bias, the distributions of heterozygosity are more comparable; however, the ascertainment correction appears to have inflated the variance among windows in HS.

Figure 3.

Figure 3.

Scatterplot of uncorrected HT for the HapMap data (_x_-axis) and the Perlegen data (_y_-axis). Each circle represents a 500-kb window, and the plot depicts the entire HapMap and Perlegen genome-wide samples.

Figure 4.

Figure 4.

Distributions of FST between European and Chinese samples for ascertainment-corrected 500-kb windows of the HapMap data (top) and the Perlegen data (bottom).

Figure 5.

Figure 5.

Scatterplot of FST between European and Chinese samples for ascertainment-corrected 500-kb windows of the HapMap data vs. the Perlegen data.

Figure 6.

Figure 6.

Uncorrected (top) and ascertainment-corrected site frequency spectra (bottom) for the HapMap data (red)and the Perlegen data (blue dashed line). The HapMap data seriously underrepresented the rare SNPs compared with Perlegen, and the ascertainment correction produced frequency spectra that were more similar (bottom).

Similar articles

Cited by

References

    1. Akey, J.M., Zhang, K., Xiong, M., and Jin, L. 2003. The effect of single nucleotide polymorphism identification strategies on estimates of linkage disequilibrium. Mol. Biol. Evol. 20: 232–242. - PubMed
    1. Bustamante, C.D., Fledel-Alon, A., Williamson, S., Nielsen, R., Hubisz, M.T., Glanowski, S., Tanenbaum, D.M., White, T.J., Sninsky, J.J., Hernandez, R., et al. 2005. Natural selection on protein coding genes in the human genome. Nature (in press). - PubMed
    1. Crawford, D.C., Carlson, C.S., Rieder, M.J., Carrington, D.P., Yi, Q., Smith, J.D., Eberle, M.A., Kruglyak, L., and Nickerson, D.A. 2004. Haplotype diversity across 100 candidate genes for inflammation, lipid metabolism, and blood pressure regulation in two populations. Am. J. Hum. Genet. 74: 610–622. - PMC - PubMed
    1. Fay, J.C. and Wu, C.I. 2000. Hitchhiking under positive Darwinian selection. Genetics 155: 1405–1413. - PMC - PubMed
    1. Gibbs, R.A., Belmont, J.W., Hardenbol, P., Willis, T.D., Yu, F., Yang, H., Chang, L.-Y., Huang, W., Liu, B., Shen, Y., et al. 2003. The International HapMap Project. Nature 426: 789–796. - PubMed

Web site references

    1. http://egp.gs.washington.edu; NIEHS resequencing study.
    1. http://www.hapmap.org; International HapMap Project.
    1. http://genome.perlegen.com/browser/download.html; Perlegen Sciences Web site.
    1. http://www.hapmap.org/downloads/encode1.html; HapMap .subjects

Publication types

MeSH terms

Grants and funding

LinkOut - more resources