PCA-correlated SNPs for structure identification in worldwide human populations - PubMed (original) (raw)

PCA-correlated SNPs for structure identification in worldwide human populations

Peristera Paschou et al. PLoS Genet. 2007 Sep.

Abstract

Existing methods to ascertain small sets of markers for the identification of human population structure require prior knowledge of individual ancestry. Based on Principal Components Analysis (PCA), and recent results in theoretical computer science, we present a novel algorithm that, applied on genomewide data, selects small subsets of SNPs (PCA-correlated SNPs) to reproduce the structure found by PCA on the complete dataset, without use of ancestry information. Evaluating our method on a previously described dataset (10,805 SNPs, 11 populations), we demonstrate that a very small set of PCA-correlated SNPs can be effectively employed to assign individuals to particular continents or populations, using a simple clustering algorithm. We validate our methods on the HapMap populations and achieve perfect intercontinental differentiation with 14 PCA-correlated SNPs. The Chinese and Japanese populations can be easily differentiated using less than 100 PCA-correlated SNPs ascertained after evaluating 1.7 million SNPs from HapMap. We show that, in general, structure informative SNPs are not portable across geographic regions. However, we manage to identify a general set of 50 PCA-correlated SNPs that effectively assigns individuals to one of nine different populations. Compared to analysis with the measure of informativeness, our methods, although unsupervised, achieved similar results. We proceed to demonstrate that our algorithm can be effectively used for the analysis of admixed populations without having to trace the origin of individuals. Analyzing a Puerto Rican dataset (192 individuals, 7,257 SNPs), we show that PCA-correlated SNPs can be used to successfully predict structure and ancestry proportions. We subsequently validate these SNPs for structure identification in an independent Puerto Rican dataset. The algorithm that we introduce runs in seconds and can be easily applied on large genome-wide datasets, facilitating the identification of population substructure, stratification assessment in multi-stage whole-genome association studies, and the study of demographic history in human populations.

PubMed Disclaimer

Conflict of interest statement

Competing interests. The authors have declared that no competing interests exist.

Figures

Figure 1

Figure 1. Average Correlation Coefficient between True and Predicted Membership of an Individual to a Particular Population or Continental Region, Using PCA and _k_-Means Clustering on all Available SNPs for a Given Geographic Region, and Sets of Ten to 200 PCA-Correlated, High-In or Random SNPs (Random Selection Was Repeated 30 Times)

The reported correlation coefficient is averaged over all populations in the respective geographic region or over the broad continental clusters.

Figure 2

Figure 2. Selecting PCA-Correlated SNPs for Intercontinental Clustering

(A) Raster plot of 255 subjects from four different continental regions with respect to 9,419 SNPs (red/green denotes homozygotic individuals and black denotes hererozygotic individuals). (B) The scores pj for each SNP. A red star indicates SNPs corresponding to one of the top 30 scores. (C) Raster plot of the 255 subjects with respect to the top 30 PCA-correlated SNPs. Notice the patterns formed in the four continental blocks. (D) Plot of the 255 subjects in the “optimal” 2-D space using the top 30 PCA-correlated SNPs. (E) Raster plot of the 255 subjects with respect to the top 30 In SNPs. Notice that the blocks corresponding to Asia and Europe are slightly more entangled when compared to (C).

Figure 3

Figure 3. Cross-Validation of Structure Informative SNPs Selected for Intercontinental Clustering

(A, B) Split of our worldwide sample in 50% training and 50% test set. Average correlation coefficient between true and predicted membership of an individual to a continental region using sets of (A) ten to 200 PCA-correlated or (B) ten to 200 high-In SNPs selected on the training set, and application of the same sets of selected SNPs on the test set (results are averaged over 50 training/test set splits). (C) Application of the SNP panels selected for intercontinental clustering in our worldwide sample, on the HapMap populations (average correlation coefficient between true and predicted membership of an individual to one of three continents is shown).

Figure 4

Figure 4. Analysis of 1.7 Million SNPs Typed on the HapMap Han Chinese and Japanese populations (Available from the HapMap Database)

(A) Projection of all 90 Han Chinese and Japanese individuals on the top two principal components using PCA on all available SNPs (B) _k_-Means clustering on panel (A). (C) Average correlation coefficient between true and predicted membership of an individual to the Japanese of Han Chinese populations, using PCA and _k_-means clustering on all available SNPs and sets of 50 to 1,000 PCA-correlated, high-In or random SNPs (random selection was repeated 30 times). The dotted line represents a decline in the performance of high-In SNPs due to the detection of a very large number of significant principal components; see Results for details.

Figure 5

Figure 5. Analysis of Nine Indigenous Populations Typed for 9,160 SNPs

(A) Projection of all individuals of nine indigenous populations on the top three principal components using PCA on all available SNPs. (Ten significant principal components were actually detected.) (B) Average correlation coefficient between true and predicted membership of the individuals to the nine populations, using PCA and _k_-means clustering on all available SNPs and sets of ten to 400 PCA-correlated, high-In or random SNPs (random selection was repeated 30 times).

Figure 6

Figure 6. Applying PCA-Correlated SNPs for Structure and Ancestry Prediction of the Admixed Puerto-Rican Population

(A) PCA on 7,259 SNPs typed on Puerto-Rican dataset A, as well as Europeans (Spanish and Caucasians), West Africans (Burunge), and Native Americans (Nahua and Quechua) (axes of variation are shown). (B) Projection of 192 individuals from Puerto Rican dataset A on two significant principal components and variation across the European-West African axis. (C) Comparison of ancestry coefficient of 192 Puerto Ricans across the West African-European axis and predicted ancestry coefficient using the top 200 PCA-correlated SNPs. (D) Prediction of West African-European ancestry coefficient in Puerto Rican dataset A using PCA-correlated SNPs versus random SNPs. (E) Using PCA-correlated SNPs selected as structure informative in Puerto Rican dataset A for ancestry prediction in Puerto Rican dataset B.

References

    1. Cavalli-Sforza L, Feldman M. The application of molecular genetic approaches to the study of human evolution. Nat Genet. 2003;33:266–275. - PubMed
    1. Lander E, Schork N. Genetic dissection of complex traits. Science. 1994;265:2037–2048. - PubMed
    1. Ziv E, Burchard E. Human population structure and genetic association studies. Pharmacogenomics. 2003;4:431–441. - PubMed
    1. Marchini J, Cardon L, Phillips M, Donnelly P. The effects of human population structure on large genetic association studies. Nat Genet. 2004;36:512–517. - PubMed
    1. Campbell C, Ogburn E, Lunetta K, Lyon H, Freedman M, et al. Demonstrating stratification in a European American population. Nat Genet. 2005;37:868–872. - PubMed

Publication types

MeSH terms

Grants and funding

LinkOut - more resources