Summix2 (original) (raw)

Summix2 Functionalities

Summix2 has three primary methods to detect and leverage substructure in genetic summary data. The summix() function is used to detect substructure proportions given AFs for the observed and reference groups, summix_local() is the used to detect local substructure proportions, and adjAF() adjusts AFs for an observed sample to match the substructure proportions of a target sample or individual.

Example of Summix2 Workflow

Before diving into running Summix2 functions and interpreting output, let’s start by considering a sample of African (AFR) individuals we are interested in studying. We see in the pie chart below that this sample has 100% African-like genetic substructure.

Lets assume we need access to allele frequencies (AFs) from an appropriate control group (a group with 100% African genetic substructure) to complete the study. We can access publicly available AFs from the Genome Aggregation Database (gnomAD). However, gnomAD only offers AFs across 20,744 African/African American (AFR/AFRAM) individuals. In order for these AFs to be appropriate controls for a study of African individuals, the population substructure within gnomAD’s AFR/AFRAM AFs needs to be accounted for.

Here is where Summix2 comes in.

We begin by using summix() to capture the population substructure in the AFR/AFRAM AFs. We prepare a data frame containing the observed gnomAD AFR/AFRAM AFs, and reference AFs from homogenous groups. Here, we use AFs from homogenous continental-level reference groups- sourced from the Human Genome Diversity Project and 1000 Genomes Project data released with gnomAD v3.1.2.

library(Summix)
print(head(ancestryData))
#>        POS REF   ALT CHROM reference_AF_afr reference_AF_eas reference_AF_eur
#> 1 31652001   T     A chr22      0.040925268        0.0000000      0.000000000
#> 2 34509945   C     G chr22      0.217971527        0.0000000      0.000000000
#> 3 34636589 CAA     C chr22      0.181117576        0.0000000      0.001149425
#> 4 38889885   A   AAG chr22      0.007117446        0.0000000      0.000000000
#> 5 49160931   G     T chr22      0.064056997        0.0000000      0.000000000
#> 6 17604199   C CAGGA chr22      0.219750879        0.1654624      0.070476185
#>   reference_AF_iam reference_AF_sas gnomad_AF_afr
#> 1       0.00000000       0.00000000    0.04171490
#> 2       0.00000000       0.00000000    0.18774500
#> 3       0.00000000       0.00000000    0.15198300
#> 4       0.00000000       0.00000000    0.00422064
#> 5       0.00000000       0.00000000    0.05445710
#> 6       0.09523803       0.07459678    0.19200400

Next, we apply summix() to the data frame with the observed group set as gnomAD’s AFR/AFRAM AF vector (observed=“gnomad_AF_afr”) the reference groupings set as all homogenous continental-level reference groups (reference=c(“reference_AF_afr”, “reference_AF_eas”, “reference_AF_eur”, “reference_AF_iam”, “reference_AF_sas”).

summix(data = ancestryData,
    reference=c("reference_AF_afr",
        "reference_AF_eas",
        "reference_AF_eur",
        "reference_AF_iam",
        "reference_AF_sas"),
    observed="gnomad_AF_afr")
#> Warning in nloptr::slsqp(starting, fn = fn.refmix, gr = gr.refmix, hin =
#> hin.refmix, : The old behavior for hin >= 0 has been deprecated. Please restate
#> the inequality to be <=0. The ability to use the old behavior will be removed
#> in a future release.
#> Warning in nloptr::slsqp(starting, fn = fn.refmix, gr = gr.refmix, hin =
#> hin.refmix, : The old behavior for hin >= 0 has been deprecated. Please restate
#> the inequality to be <=0. The ability to use the old behavior will be removed
#> in a future release.
#> Warning in nloptr::slsqp(starting, fn = fn.refmix, gr = gr.refmix, hin =
#> hin.refmix, : The old behavior for hin >= 0 has been deprecated. Please restate
#> the inequality to be <=0. The ability to use the old behavior will be removed
#> in a future release.
#> Warning in nloptr::slsqp(starting, fn = fn.refmix, gr = gr.refmix, hin =
#> hin.refmix, : The old behavior for hin >= 0 has been deprecated. Please restate
#> the inequality to be <=0. The ability to use the old behavior will be removed
#> in a future release.
#> Warning in nloptr::slsqp(starting, fn = fn.refmix, gr = gr.refmix, hin =
#> hin.refmix, : The old behavior for hin >= 0 has been deprecated. Please restate
#> the inequality to be <=0. The ability to use the old behavior will be removed
#> in a future release.
#> Warning in nloptr::slsqp(starting, fn = fn.refmix, gr = gr.refmix, hin =
#> hin.refmix, : The old behavior for hin >= 0 has been deprecated. Please restate
#> the inequality to be <=0. The ability to use the old behavior will be removed
#> in a future release.
#>   goodness.of.fit iterations           time filtered reference_AF_afr
#> 1       0.4553673         20 0.5029516 secs        0         0.812142
#>   reference_AF_eur reference_AF_iam
#> 1         0.169953         0.017905

In this output we see summix() estimates the observed gnomAD AFR/AFRAM AFs to contain approximately 81% African-like, 17% EUR-like, and 1.8% IAM-like mixing proportions.

A visual representation of the above process and output:

Now that we have captured the substructure within the genetic summary data, we use Summix2’s adjAF() to adjust the substructure to match that of the 100% African target sample.

When adjusting the genetic substructure, we use only the reference groups with non-zero mixing proportions in the observed and target samples, and ensure the order of mixing proportions in pi.target and pi.observed are in the same order as the reference groups.
We also include the sample sizes for each of the reference groups (N_reference = c(704,741, 47)) and the observed group (N_observed = 20744) to get the effective sample size calculation for the adjusted AF vector.


adjusted_data<-adjAF(data = ancestryData,
     reference = c("reference_AF_afr", "reference_AF_eur", "reference_AF_iam"),
     observed = "gnomad_AF_afr",
     pi.target = c(1, 0, 0),
     pi.observed = c(0.812142, 0.169953, 0.017905),
     adj_method = 'average',
     N_reference = c(704,741, 47),
     N_observed = 20744,
     filter = TRUE)
#> 
#> 
#> [1] "Note: In this AF adjustment, 0 SNPs (with adjusted AF > -.005 & < 0) were rounded to 0. 0 SNPs (with adjusted AF > 1) were rounded to 1, and 0 SNPs (with adjusted AF <= -.005) were removed from the final results."
#> 
#> [1] $pi
#>          ref.group pi.observed pi.target
#> 1 reference_AF_afr    0.812142         1
#> 2 reference_AF_eur    0.169953         0
#> 3 reference_AF_iam    0.017905         0
#> 
#> [1] $observed.data
#> [1] "observed AF data to update: 'gnomad_AF_afr'"
#> 
#> [1] $Nsnps
#> [1] 1000
#> 
#> 
#> [1] $effective.sample.size
#> [1] 17551
#> 
#> 
#> [1] "use <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>a</mi><mi>d</mi><mi>j</mi><mi>u</mi><mi>s</mi><mi>t</mi><mi>e</mi><mi>d</mi><mi mathvariant="normal">.</mi><mi>A</mi><mi>F</mi></mrow><annotation encoding="application/x-tex">adjusted.AF</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8889em;vertical-align:-0.1944em;"></span><span class="mord mathnormal">a</span><span class="mord mathnormal">d</span><span class="mord mathnormal" style="margin-right:0.05724em;">j</span><span class="mord mathnormal">u</span><span class="mord mathnormal">s</span><span class="mord mathnormal">t</span><span class="mord mathnormal">e</span><span class="mord mathnormal">d</span><span class="mord">.</span><span class="mord mathnormal">A</span><span class="mord mathnormal" style="margin-right:0.13889em;">F</span></span></span></span>adjustedAF to see adjusted AF data"
#> 
#> 
#> [1] "Note: The accuracy of the AF adjustment is likely lower for rare variants (< .5%)."

We see that no SNPs were removed from the AF adjusted data frame (only SNPs with AFs that were less than -.005 are removed). Importantly, the effective sample size of the adjusted AF is 17,551 - which can be important for downstream analyses.

We can take a look at the adjusted AF data frame, and the appended adjusted AF vector (adjustedAF).

print(adjusted_data$adjusted.AF[1:5,])
#>        POS REF ALT CHROM reference_AF_afr reference_AF_eas reference_AF_eur
#> 1 31652001   T   A chr22      0.040925268                0      0.000000000
#> 2 34509945   C   G chr22      0.217971527                0      0.000000000
#> 3 34636589 CAA   C chr22      0.181117576                0      0.001149425
#> 4 38889885   A AAG chr22      0.007117446                0      0.000000000
#> 5 49160931   G   T chr22      0.064056997                0      0.000000000
#>   reference_AF_iam reference_AF_sas gnomad_AF_afr  adjustedAF
#> 1                0                0    0.04171490 0.044404861
#> 2                0                0    0.18774500 0.222371894
#> 3                0                0    0.15198300 0.183044358
#> 4                0                0    0.00422064 0.006477272
#> 5                0                0    0.05445710 0.065055887

Using Summix2, we can harmonize the genetic substructure across multiple data sets for secondary analyses.

An in depth look at all Summix2 functionalities