Beta and dirichlet regression for continuous proportion data (original) (raw)

Hello,

I was hoping to get some advice on how to fit some continuous proportion data using pymc/bambi. Essentially, I have multiple measurements for a given mixture and this mixture can be a binary or ternary composition. For each sample, the proportion is calculated based on the measured mass. Therefore, for a binary mixture:

p_{A} = \frac{M_A}{M_A + M_B}
p_{B} = \frac{M_B}{M_A + M_B} = 1-p_A

And for a ternary mixture:
p_{A} = \frac{M_A}{M_A + M_B + M_C}
p_{B} = \frac{M_B}{M_A + M_B + M_C}
p_{B} = \frac{M_C}{M_A + M_B + M_C} = 1 - p_A - p_B

For the binary mixture case, I believe all I need is a beta regression which is straight forward in Bambi. Here is an example of the dataframe:

Using the example in Beta Regression — Bambi 0.8.0 documentation, I’m able to fit this binary data quite well while also taking into account the study level:

model = bmb.Model("p_A ~ 1 + (1|study)", data, family="beta")
fitted = model.fit()

However, I run into issues when attempting to fit a ternary mixture where the data looks like this:

This isn’t quite going to be a dirichlet-multinomial problem because I’m dealing with a continuous proportion and not count data. Does Bambi have analogous capabilities for a dirichlet regression where n>2? If not, can I get some help for setting up the model with pm.Dirichlet where p is observed as opposed to some measured counts?

Thank you in advance for the help.

I will let one of the more bambi-savvy folks answer you actual question. However, I will note that there are several examples of this type of model built in pymc (rather than bambi). This thread, and Alex’s chiming in with a pointer to this notebook, as well as the discussion on this old issue. If you’re looking to get your hands dirty a bit, those may help you get going (and of course you can ask here if they aren’t exactly what you need).

@zult unfortunately Bambi does not support anything related to Dirichlet distribution yet

zult April 6, 2023, 2:35pm 4

Hello,

I’m revisiting this problem for a hierarchical Dirichlet regression and I was hoping to get some insight into setting up the GLM for this problem.

As outlined above, there are cases where we need to predict the composition of a mixture for more than two compounds. In this more generic case, the reported fractions for a sample contain compound A, B, and C with measurements from multiple studies and we’d like to account for these study differences through a hierarchical model.

My attempt to solve this goes as follows. Assume we have X_norm and X_features. Here, X_norm is an Nx3 (N measurements) matrix containing the fraction composition (A, B, and C) of each measurement. The sum of X_norm across each row is 1. X_features contains the study designation and features for each measurement.

Using these two matrices I create a design matrix using patsy and factorize the studies.

import pandas as pd
import patsy

study_idx, study_codes = pd.factorize(X_features['STUDY_ID'])
feature_design = patsy.dmatrix('1 + feature1 + feature2', X_features, return_type='dataframe')

Using these variables, I now build the hierarchical Dirichlet GLM.

import pymc as pm
import pytensor.tensor as pt

coords = {'chem': X_norm.columns, 'sample':X_norm.index, 'indep': feature_design.columns, 'study': study_codes}
with pm.Model(coords=coords) as model:
    descriptors = pm.Data('descriptors', feature_design.values, mutable=True, dims=("sample", "indep"))
    y = pm.Data('y', X_norm.values, mutable=True, dims=("sample", "chem"))

    theta = pm.Normal('theta', 0, 1)
    exptheta = pm.Deterministic('exptheta', pm.math.exp(theta))

    mu_D = pm.Normal('mu_D', 0, 1, dims=("indep", "chem")) # Descriptor coefficients
    sigma_D = pm.HalfNormal('sigma_D', 0.1, dims=("indep", "chem")) # Descriptor coefficients
    
    D_offset = pm.Normal('D_offset', mu=0, sigma=1, dims=("study", "indep", 'chem'))
    D = pm.Deterministic('D', mu_D + sigma_D*D_offset, dims=("study", "indep", 'chem'))

    # Index the study-specific independent variables
    Ds = pm.Deterministic('Ds', D[study_idx], dims = ('sample', 'indep', 'chem')) 

    # desciprors has size (sample, indep), i.e. n_rows = # of samples, n_cols = # of indepdendent variables
    # Ds has size (sample, indep, chem), i.e. multidimensional with n_rows = # of indepdendent variables, n_cols = # of chemicals for every sample
    # For linear model, need to calculate the dot product of the independnet variables for every sample
    # Use pt.batched_dot which will scan 'Ds' to find the dimension that matches the second dimension on 'descriptors' (which is indep)
    Dk = pm.Deterministic('Dk', pt.batched_dot(descriptors, Ds))

    eta = pm.Deterministic("eta", pm.math.invlogit(Dk)*exptheta, dims=("sample", "chem"))
    pm.Dirichlet('measure', a=eta, observed=y, dims=("sample", "chem"))

This appears to do a nice job of handling all the mixtures cases where n_chem>=2. However, the model is very sensitive to the choice of sigma_D and sigma_D ~ HalfNormal(1) can lead to convergence issues.

My questions are:

Are there any glaring issues with this approach for modeling a hierarchical Dirichlet regression GLM?
Is there a more 'bambi-esque` way to include the studies as a random variable in the design matrix for the model as opposed to manually indexing each study-specific descriptor?

Thanks as always to the community for the help with these technical questions.

Kev September 4, 2024, 3:35am 5

Hi. I too am trying to make a regression where the Ys are float values of subcategories that sum up to 1 (observed dirichlet outputs). Your links are helpful, but uses pm.Categorical which assumes discreet count data as the response variable, where as the data in trying to model is continuous (i.e a single trial can produce fractions of multiple categories). Do you have an example of this?

ckrapu September 8, 2024, 3:51am 6

@Kev, does the previous response from zult suitable to answer your question? You may also want to consult this link: Dirichlet Regression with PyMC | R-bloggers.

The softmax function is suitable for transforming a linear, unconstrained variable to the unit simplex as desired in Dirichlet regression, ~~so you should be able to take any examples using multinomial/softmax regression and simply replace the Categorical likelihood with the Dirichlet one to come up with a valid model.~~

My mistake - the transformation just needs to be positive, as @ricardoV94 mentioned below.

You don’t need the softmax for a Dirichlet regression right? Since alpha is just constrained to be positive

Kev September 15, 2024, 4:31am 8

Thanks!