Clines, clusters, and the effect of study design on the inference of human population structure - PubMed (original) (raw)
Clines, clusters, and the effect of study design on the inference of human population structure
Noah A Rosenberg et al. PLoS Genet. 2005 Dec.
Abstract
Previously, we observed that without using prior information about individual sampling locations, a clustering algorithm applied to multilocus genotypes from worldwide human populations produced genetic clusters largely coincident with major geographic regions. It has been argued, however, that the degree of clustering is diminished by use of samples with greater uniformity in geographic distribution, and that the clusters we identified were a consequence of uneven sampling along genetic clines. Expanding our earlier dataset from 377 to 993 markers, we systematically examine the influence of several study design variables--sample size, number of loci, number of clusters, assumptions about correlations in allele frequencies across populations, and the geographic dispersion of the sample--on the "clusteredness" of individuals. With all other variables held constant, geographic dispersion is seen to have comparatively little effect on the degree of clustering. Examination of the relationship between genetic and geographic distance supports a view in which the clusters arise not as an artifact of the sampling scheme, but from small discontinuous jumps in genetic distance for most population pairs on opposite sides of geographic barriers, in comparison with genetic distance for pairs on the same side. Thus, analysis of the 993-locus dataset corroborates our earlier results: if enough markers are used with a sufficiently large worldwide sample, individuals can be partitioned into genetic clusters that match major geographic subdivisions of the globe, with some individuals from intermediate geographic locations having mixed membership in the clusters that correspond to neighboring regions.
Conflict of interest statement
Competing interests. The authors have declared that no competing interests exist.
Figures
Figure 1. Distribution of the Geographic Dispersion Statistic (An) for Sets of 100 Points Randomly Sampled from a Sphere, Randomly Sampled from the Land Area of the Earth (from among the Points Plotted in Figure 5 of [11]), and Randomly Sampled from the Reported Locations of Individuals in the Dataset
Each distribution is obtained by binning the values of An for 100,000 sets of points.
Figure 2. Inferred Population Structure Based on 1,048 Individuals and 993 Markers, Assuming Correlations among Allele Frequencies across Clusters
Each individual is represented by a thin line partitioned into K colored segments that represent the individual's estimated membership fractions in K clusters. Each plot, produced with DISTRUCT [23], is based on the highest-likelihood run of ten runs: the two runs that were used in further analysis, and the eight runs described under “Cluster Analysis using STRUCTURE.” As in [3], four of ten runs with K = 3 separated a cluster corresponding to East Asia instead of one corresponding to Europe, the Middle East, and Central/South Asia. Two of ten runs with K = 5 separated Surui instead of Oceania. The highest-likelihood run of the ten runs with K = 6, shown in the figure, had a different pattern from the other nine runs (not shown). These other runs, instead of subdividing native Americans into two clusters, subdivided a cluster roughly similar to the Kalash cluster seen in [3], except with a less pronounced separation of the Kalash population. The clusteredness scores for the plots shown with K = 2, 3, 4, 5, and 6 are 0.50, 0.76, 0.84, 0.86, and 0.87, respectively.
Figure 3. Mean Clusteredness versus Number of Loci
Each point shows the mean clusteredness of 2,000 runs with the specified sample size and allele frequency correlation model: two replicates for each of ten sets of loci for each of 100 sets of individuals (for 1,048 individuals, it is the mean of 20 runs, as only one set of individuals was used; for 1,048 individuals and 993 loci, it is the mean of two runs, as only one set of loci was used). Error bars denote standard deviations. The _x_-axis is plotted on a logarithmic scale.
Figure 4. Mean Clusteredness versus Geographic Dispersion as Measured by An
Each point shows the mean clusteredness of 20 runs with the specified number of loci and allele frequency correlation model: two replicates for each of ten sets of loci (for 993 loci, it is the mean of two runs, as only one set of loci was used). From left to right, the three groups of points in each plot respectively represent sets of 100, 250, and 500 individuals.
Figure 5. Inferred Population Structure Based on Two Different Sets of 100 Individuals, Using 993 Markers and the Correlated Allele Frequencies Model
The two sets of 100 individuals represent extremes of the distribution of An: the plots on the left are based on a more geographically random sample, and those on the right are based on a less random sample. Each plot is based on the higher-likelihood run among the two runs performed with the given combination of loci and individuals. In all plots, individuals and populations are in the same order as in Figure 2. Black vertical lines at the bottom of the figure separate populations from the different geographic regions described in [3], with the asterisk representing Oceania.
Figure 6. Genetic and Geographic Distance for Pairs of Populations
Red circles indicate comparisons between pairs of populations with majority representation in the same cluster in the K = 5 plot of Figure 2; blue triangles indicate pairs with one population from Eurasia and one from East Asia; brown squares indicate pairs with one population from Africa and the other from Eurasia; and green diamonds indicate pairs with one population from East Asia and the other from either Oceania or America. Comparisons involving one of Hazara, Kalash, and Uygur and other populations from Eurasia or East Asia are marked 1, 2, and 3, respectively. No comparisons are shown between any of these three groups and any African population.
Similar articles
- Human loci involved in drug biotransformation: worldwide genetic variation, population structure, and pharmacogenetic implications.
Maisano Delser P, Fuselli S. Maisano Delser P, et al. Hum Genet. 2013 May;132(5):563-77. doi: 10.1007/s00439-013-1268-5. Epub 2013 Jan 26. Hum Genet. 2013. PMID: 23354977 - Genetic structure of human populations.
Rosenberg NA, Pritchard JK, Weber JL, Cann HM, Kidd KK, Zhivotovsky LA, Feldman MW. Rosenberg NA, et al. Science. 2002 Dec 20;298(5602):2381-5. doi: 10.1126/science.1078311. Science. 2002. PMID: 12493913 - Genomic boundaries between human populations.
Barbujani G, Belle EM. Barbujani G, et al. Hum Hered. 2006;61(1):15-21. doi: 10.1159/000091832. Epub 2006 Mar 7. Hum Hered. 2006. PMID: 16534211 - Going the distance: human population genetics in a clinal world.
Handley LJ, Manica A, Goudet J, Balloux F. Handley LJ, et al. Trends Genet. 2007 Sep;23(9):432-9. doi: 10.1016/j.tig.2007.07.002. Epub 2007 Jul 25. Trends Genet. 2007. PMID: 17655965 Review.
Cited by
- Analysis of Gyimes Csango population samples on a high-resolution genome-wide basis.
Bánfai Z, Büki G, Ádám V, Sümegi K, Szabó A, Hadzsiev K, Erős K, Gallyas F, Miseta A, Kásler M, Melegh B. Bánfai Z, et al. BMC Genomics. 2024 Oct 7;25(1):942. doi: 10.1186/s12864-024-10833-x. BMC Genomics. 2024. PMID: 39375616 Free PMC article. - Population genetics meets ecology: a guide to individual-based simulations in continuous landscapes.
Chevy ET, Min J, Caudill V, Champer SE, Haller BC, Rehmann CT, Smith CCR, Tittes S, Messer PW, Kern AD, Ramachandran S, Ralph PL. Chevy ET, et al. bioRxiv [Preprint]. 2024 Jul 24:2024.07.24.604988. doi: 10.1101/2024.07.24.604988. bioRxiv. 2024. PMID: 39091875 Free PMC article. Preprint. - Genome-wide association study reveals marker-trait associations for major agronomic traits in proso millet (Panicum miliaceum L.).
Khound R, Rajput SG, Schnable JC, Vetriventhan M, Santra DK. Khound R, et al. Planta. 2024 Jul 4;260(2):44. doi: 10.1007/s00425-024-04465-4. Planta. 2024. PMID: 38963439 - Estimating scale-specific and localized spatial patterns in allele frequency.
Lasky JR, Takou M, Gamba D, Keitt TH. Lasky JR, et al. Genetics. 2024 Jul 8;227(3):iyae082. doi: 10.1093/genetics/iyae082. Genetics. 2024. PMID: 38758968 - Cross-cultural perception of strength, attractiveness, aggressiveness and helpfulness of Maasai male faces calibrated to handgrip strength.
Butovskaya ML, Rostovstseva VV, Mezentseva AA, Kavina A, Rizwan M, Shi Y, Vilimek V, Davletshin A. Butovskaya ML, et al. Sci Rep. 2024 Mar 11;14(1):5880. doi: 10.1038/s41598-024-56607-z. Sci Rep. 2024. PMID: 38467751 Free PMC article.
References
- Bowcock AM, Ruiz-Linares A, Tomfohrde J, Minch E, Kidd JR, et al. High resolution of human evolutionary trees with polymorphic microsatellites. Nature. 1994;368:455–457. - PubMed
- Rosenberg NA, Pritchard JK, Weber JL, Cann HM, Kidd KK, et al. Genetic structure of human populations. Science. 2002;298:2381–2385. - PubMed
Publication types
MeSH terms
Grants and funding
- GM28016/GM/NIGMS NIH HHS/United States
- R01 GM028016/GM/NIGMS NIH HHS/United States
- T32 HG00044/HG/NHGRI NIH HHS/United States
- N01HV48141/HV/NHLBI NIH HHS/United States
- T32 HG000044/HG/NHGRI NIH HHS/United States
- HV48141/HV/NHLBI NIH HHS/United States
LinkOut - more resources
Full Text Sources
Other Literature Sources