Analysis of population structure: a unifying framework and novel methods based on sparse factor analysis - PubMed (original) (raw)

Analysis of population structure: a unifying framework and novel methods based on sparse factor analysis

Barbara E Engelhardt et al. PLoS Genet. 2010.

Abstract

We consider the statistical analysis of population structure using genetic data. We show how the two most widely used approaches to modeling population structure, admixture-based models and principal components analysis (PCA), can be viewed within a single unifying framework of matrix factorization. Specifically, they can both be interpreted as approximating an observed genotype matrix by a product of two lower-rank matrices, but with different constraints or prior distributions on these lower-rank matrices. This opens the door to a large range of possible approaches to analyzing population structure, by considering other constraints or priors. In this paper, we introduce one such novel approach, based on sparse factor analysis (SFA). We investigate the effects of the different types of constraint in several real and simulated data sets. We find that SFA produces similar results to admixture-based models when the samples are descended from a few well-differentiated ancestral populations and can recapitulate the results of PCA when the population structure is more "continuous," as in isolation-by-distance models.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Figure 1

Figure 1. Low-dimensional matrix factorization via factor analysis.

Each matrix in Equation 1 is illustrated by a blue rectangle and labeled. As in Equation 2, a single element of genotype matrix formula image, formula image is shown in red, and is computed from the product of the appropriate factor loading and factor vectors plus the corresponding random error term (all highlighted in red).

Figure 2

Figure 2. Illustration of two different ways that African and European individuals could be represented.

In the first (sparse) representation in the first row, the factors (shown in red) each represent the mean allele frequencies for either the African population (formula image) or the European population (formula image); this lends to sparse loadings (shown in blue) for each individual, since the African individuals are only loaded on the factor representing the African population, and likewise for the European individuals. In the second (non-sparse) representation in the second row, each factor is a combination of formula image and formula image, and each individual is loaded onto both factors. Note that the representations are equivalent by the equations under the table. Whereas SFA and admixture-based models tend to choose the first representation because of the sparse priors and implicit regularization, PCA tends towards the second representation (although the actual factors depend on other features of the data such as sample sizes of both groups).

Figure 3

Figure 3. Results of applying SFA, PCA, and admixture to the HapMap genotype data.

Each plot shows the estimated loadings (formula image-axis) across individuals (formula image-axis). SFA loadings are in the first row, PCA loadings in the second, and

admixture

loadings in the third. European individuals are denoted with blue ‘x’s, African individuals are denoted with red triangles, and Asian individuals are denoted with green ‘+’s. A dashed horizontal line is at zero on the formula image-axis.

Figure 4

Figure 4. Estimated factor loadings from PCA, SFAm, SFA, and admixture for the 1-D isolation-by-distance simulation.

In each plot the individuals are colored and ordered along the formula image-axis by location in the 1-D habitat.

Figure 5

Figure 5. Estimated scaled factors from SFA and admixture on the 1-D isolation-by-distance simulation against the generating allele frequencies.

In each plot the factors (formula image-axis) are plotted against the population allele frequencies for the closest-matching population. The SFA factors were truncated to have a minimum of zero and scaled to have a maximum of one. The dashed diagonal line shows formula image.

Figure 6

Figure 6. Results of SFA, PCA, SFAm, and admixture applied to simulated genotype data from a single 2-D habitat.

In Panel A, each dot represents a population colored according to location. In Panel B, each plot is of the loadings across individuals against each other, where the colors correspond to their locations in Panel A. The first row shows the three SFA loadings against each other from a three factor model. The second row shows the second two PCA loadings, the SFAm loadings, and the mapped

admixture

loadings (see text for details). All of the methods recapitulate, to a greater or lesser extent, the geographical structure of the habitats (up to rotation).

Figure 7

Figure 7. Results on simulated genotype data from a two independent 2-D habitats.

In Panel A, each dot represents a population colored according to habitat and location. Colors in Panels B and C indicate locations in Panel A. Panel B shows how SFA captures the structure with a six factor model. Loadings on the first three factors (first row of Panel B) correspond to location in the first habitat; individuals in the second habitat have essentially zero loading on these factors. Similarly, loadings on the other three factors (second row of Panel B) correspond to location in the second habitat. Panel C shows estimated loadings from PCA for the same data. Each plot shows one loading plotted against another. Although the PCA results clearly reflect the underlying structure one might struggle to infer the structure from visual inspection of these plots if the colors were unknown.

Figure 8

Figure 8. Results from SFA, admixture, and PCA for the clustered 1-D simulation.

All plots show the individuals on the formula image-axis (colored and ordered by location with respect to the 1-D clustered isolation-by-distance model) plotted against the estimated loadings.

Figure 9

Figure 9. Results from PCA, SFAm, and admixture for the POPRES European data.

These results were rotated (but not rescaled) to make the correspondence to the map of Europe more immediately obvious. The results from SFAm are very similar to the results from PCA for these data, effectively recapitulating the geography of Europe.

Figure 10

Figure 10. Plot of estimated admixture proportions of each Indian group versus the relative admixture proportions from SFA on the Indian data set.

This plot shows good correlation between the relative admixture proportions from SFA and the estimated admixture proportions from previous work. The colors coding the groups are described in the India map.

Similar articles

Cited by

References

    1. Rosenberg NA, Pritchard JK, Weber JL, Cann HM, Kidd KK, et al. Genetic Structure of Human Populations. Science. 2002;298:2381–2385. - PubMed
    1. Reich D, Thangaraj K, Patterson N, Price AL, Singh L. Reconstructing Indian population history. Nature. 2009;461:489–494. - PMC - PubMed
    1. Wasser SK, Mailand C, Booth R, Mutayoba B, Kisamo E, et al. Using DNA to track the origin of the largest ivory seizure since the 1989 trade ban. Proceedings of the National Academy of Sciences. 2007;104:4228–4233. - PMC - PubMed
    1. Parker HG, Kim LV, Sutter NB, Carlson S, Lorentzen TD, et al. Genetic Structure of the Purebred Domestic Dog. Science. 2004;304:1160–1164. - PubMed
    1. Pritchard JK, Rosenberg NA. Use of unlinked genetic markers to detect population stratification in association studies. American Journal of Human Genetics. 1999;65:220–228. - PMC - PubMed

Publication types

MeSH terms

LinkOut - more resources