The structure of common genetic variation in United States populations - PubMed (original) (raw)

. 2007 Dec;81(6):1221-31.

doi: 10.1086/522239. Epub 2007 Oct 16.

Affiliations

The structure of common genetic variation in United States populations

Stephen L Guthery et al. Am J Hum Genet. 2007 Dec.

Abstract

The common-variant/common-disease model predicts that most risk alleles underlying complex health-related traits are common and, therefore, old and found in multiple populations, rather than being rare or population specific. Accordingly, there is widespread interest in assessing the population structure of common alleles. However, such assessments have been confounded by analysis of data sets with bias toward ascertainment of common alleles (e.g., HapMap and Perlegen) or in which a relatively small number of genes and/or populations were sampled. The aim of this study was to examine the structure of common variation ascertained in major U.S. populations, by resequencing the exons and flanking regions of 3,873 genes in 154 chromosomes from European, Latino/Hispanic, Asian, and African Americans generated by the Genaissance Resequencing Project. The frequency distributions of private and common single-nucleotide polymorphisms (SNPs) were measured, and the extent to which common SNPs were shared across populations was analyzed using several different estimators of population structure. Most SNPs that were common in one population were present in multiple populations, but SNPs common in one population were frequently not common in other populations. Moreover, SNPs that were common in two or more populations often differed significantly in frequency from one population to another, particularly in comparisons of African Americans versus other U.S. populations. These findings indicate that, even if the bulk of alleles underlying complex health-related traits are common SNPs, geographic ancestry might well be an important predictor of whether a person carries a risk allele.

PubMed Disclaimer

Figures

Figure  1.

Figure 1.

Summary statistics for data-reduction methods used to estimate population structure. A, Eigenvalues versus the number of principal components. B, Number of clusters plotted as a function of the pseudo F statistic obtained from the UPGMA algorithm. C, Number of clusters plotted as a function of Ln_P_(D) from STRUCTURE.

Figure  2.

Figure 2.

Site-frequency distributions for SNP data from 3,873 genes. A, SNP site–frequency distribution for the total sample. Of a total 63,127 SNPs (black bars) in the data set, 39% (

_n_=24,982

) were singletons (red bar). B, Number and distribution of private SNPs in each population determined from the site-frequency distribution of the total sample. The majority of private SNPs were observed in seven or fewer chromosomes, illustrated by cumulative frequency (gray line). C and D, SNP site–frequency distribution for each population. E and F, SNP site–frequency distributions for African Americans versus non–African Americans.

Figure  3.

Figure 3.

Distribution of common SNPs among Latino/Hispanic, African, Asian, and European Americans. A and B, The percentage of SNPs that are common (i.e., ⩾10%) in at least one population but are found in both populations (black bars) is high overall but varies from ∼74% to 96%. A modest percentage of common SNPs that are common in at least one population are absent in the other populations (gray bars). C and D, The percentage of common SNPs common in both populations (black bars) compared with SNPs common in only each population compared: African Americans (AfA) (blue bars), Asian Americans (AsA) (red bars), European Americans (EA) (green bars), and Latino/Hispanic Americans (HA) (orange bars). Overall, only a modest percentage (44%–72%) of SNPs common in at least one population are common in both populations. A substantial proportion of common SNPs in African Americans are common only in African Americans.

Figure  4.

Figure 4.

Contour plot of minor-SNP frequencies between pairs of populations. Plots compare frequencies of SNPs (

_n_=38,145

) excluding singletons. Each plot represents a scatterplot with minor-SNP frequency from a given population on each axis. Plots are divided into 3,600 grids (60×60 grids), and the number of data points within each grid is color coded. For example, purple represents 0.01 data points per grid, and red represents 100 data points per grid (see legend in the upper right-hand corner).

Figure  5.

Figure 5.

Measures of SNP sharing among Latino/Hispanic (HA), African (AfA), Asian (AsA), and European (EA) Americans. For all figures, the _X_-axis represents overlapping bins (i.e., >0.05 represents all SNPs with MAF >0.05), and MAF is calculated across all 152 chromosomes. When two populations are compared, MAF is calculated separately for each population. A, Pairwise comparisons of the proportion of SNPs shared between populations. B, Mean differences of pairwise comparisons of MAF between SNPs. C, Spearman rank correlation coefficients among pairwise comparisons of MAF between SNPs. D, Pairwise _F_ST estimates between SNPs. The solid black line in each figure represents the mean value, and the dotted lines indicate the CI of values estimated from 1,000 data sets in which individuals were randomly distributed into pairs of populations (see text for details). ns = nonsingletons.

Figure  6.

Figure 6.

Site-frequency distribution of synonymous (syn) (gray bars) and nonsynonymous (nonsyn) (black bars) SNPs for the total sample and for African Americans (AfA) versus non–African Americans.

Figure  7.

Figure 7.

Estimation of population structure in GRP samples. AfA = African American; AsA = Asian American; EA = European American; HA = Latino/Hispanic American. A, Phylogenetic network based on genetic distances with the use of UPGMA. B, Plot of principal components (PCs) estimated from a genetic-distance matrix. C, Stacked bar chart with inferences from results of a model-based cluster analysis with the use of STRUCTURE 2.0. Each bar represents an individual, and each bar is divided according to the fraction of cluster membership. D, Triangle plot illustrating the percentage of African, Asian, and European American ancestry of each individual (indicated by colored shapes, as given in panel B) estimated from STRUCTURE 2.0.

References

Web Resource

    1. Bamshad Lab, http://depts.washington.edu/bamshad/data/

References

    1. Chakravarti A (1999) Population genetics—making sense out of sequence. Nat Genet 21:56–6010.1038/4482 - DOI - PubMed
    1. Lander ES (1996) The new genomics: global views of biology. Science 274:536–53910.1126/science.274.5287.536 - DOI - PubMed
    1. Pritchard JK (2001) Are rare variants responsible for susceptibility to complex diseases? Am J Hum Genet 69:124–137 - PMC - PubMed
    1. Reich DE, Lander ES (2001) On the allelic spectrum of human disease. Trends Genet 17:502–51010.1016/S0168-9525(01)02410-6 - DOI - PubMed
    1. Ioannidis JP, Ntzani EE, Trikalinos TA (2004) ‘Racial’ differences in genetic effects for complex diseases. Nat Genet 36:1312–131810.1038/ng1474 - DOI - PubMed

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources