Beta and dirichlet regression for continuous proportion data (original) (raw)
Hello,
I was hoping to get some advice on how to fit some continuous proportion data using pymc/bambi. Essentially, I have multiple measurements for a given mixture and this mixture can be a binary or ternary composition. For each sample, the proportion is calculated based on the measured mass. Therefore, for a binary mixture:
p_{A} = \frac{M_A}{M_A + M_B}
p_{B} = \frac{M_B}{M_A + M_B} = 1-p_A
And for a ternary mixture:
p_{A} = \frac{M_A}{M_A + M_B + M_C}
p_{B} = \frac{M_B}{M_A + M_B + M_C}
p_{B} = \frac{M_C}{M_A + M_B + M_C} = 1 - p_A - p_B
For the binary mixture case, I believe all I need is a beta regression which is straight forward in Bambi. Here is an example of the dataframe:
Using the example in Beta Regression — Bambi 0.8.0 documentation, I’m able to fit this binary data quite well while also taking into account the study level:
model = bmb.Model("p_A ~ 1 + (1|study)", data, family="beta")
fitted = model.fit()
However, I run into issues when attempting to fit a ternary mixture where the data looks like this:
This isn’t quite going to be a dirichlet-multinomial problem because I’m dealing with a continuous proportion and not count data. Does Bambi have analogous capabilities for a dirichlet regression where n>2? If not, can I get some help for setting up the model with pm.Dirichlet
where p is observed as opposed to some measured counts?
Thank you in advance for the help.
I will let one of the more bambi-savvy folks answer you actual question. However, I will note that there are several examples of this type of model built in pymc (rather than bambi). This thread, and Alex’s chiming in with a pointer to this notebook, as well as the discussion on this old issue. If you’re looking to get your hands dirty a bit, those may help you get going (and of course you can ask here if they aren’t exactly what you need).
@zult unfortunately Bambi does not support anything related to Dirichlet distribution yet
zult April 6, 2023, 2:35pm 4
Hello,
I’m revisiting this problem for a hierarchical Dirichlet regression and I was hoping to get some insight into setting up the GLM for this problem.
As outlined above, there are cases where we need to predict the composition of a mixture for more than two compounds. In this more generic case, the reported fractions for a sample contain compound A, B, and C with measurements from multiple studies and we’d like to account for these study differences through a hierarchical model.
My attempt to solve this goes as follows. Assume we have X_norm
and X_features
. Here, X_norm
is an Nx3 (N measurements) matrix containing the fraction composition (A, B, and C) of each measurement. The sum of X_norm
across each row is 1. X_features
contains the study designation and features for each measurement.
Using these two matrices I create a design matrix using patsy
and factorize the studies.
import pandas as pd
import patsy
study_idx, study_codes = pd.factorize(X_features['STUDY_ID'])
feature_design = patsy.dmatrix('1 + feature1 + feature2', X_features, return_type='dataframe')
Using these variables, I now build the hierarchical Dirichlet GLM.
import pymc as pm
import pytensor.tensor as pt
coords = {'chem': X_norm.columns, 'sample':X_norm.index, 'indep': feature_design.columns, 'study': study_codes}
with pm.Model(coords=coords) as model:
descriptors = pm.Data('descriptors', feature_design.values, mutable=True, dims=("sample", "indep"))
y = pm.Data('y', X_norm.values, mutable=True, dims=("sample", "chem"))
theta = pm.Normal('theta', 0, 1)
exptheta = pm.Deterministic('exptheta', pm.math.exp(theta))
mu_D = pm.Normal('mu_D', 0, 1, dims=("indep", "chem")) # Descriptor coefficients
sigma_D = pm.HalfNormal('sigma_D', 0.1, dims=("indep", "chem")) # Descriptor coefficients
D_offset = pm.Normal('D_offset', mu=0, sigma=1, dims=("study", "indep", 'chem'))
D = pm.Deterministic('D', mu_D + sigma_D*D_offset, dims=("study", "indep", 'chem'))
# Index the study-specific independent variables
Ds = pm.Deterministic('Ds', D[study_idx], dims = ('sample', 'indep', 'chem'))
# desciprors has size (sample, indep), i.e. n_rows = # of samples, n_cols = # of indepdendent variables
# Ds has size (sample, indep, chem), i.e. multidimensional with n_rows = # of indepdendent variables, n_cols = # of chemicals for every sample
# For linear model, need to calculate the dot product of the independnet variables for every sample
# Use pt.batched_dot which will scan 'Ds' to find the dimension that matches the second dimension on 'descriptors' (which is indep)
Dk = pm.Deterministic('Dk', pt.batched_dot(descriptors, Ds))
eta = pm.Deterministic("eta", pm.math.invlogit(Dk)*exptheta, dims=("sample", "chem"))
pm.Dirichlet('measure', a=eta, observed=y, dims=("sample", "chem"))
This appears to do a nice job of handling all the mixtures cases where n_chem>=2. However, the model is very sensitive to the choice of sigma_D
and sigma_D ~ HalfNormal(1)
can lead to convergence issues.
My questions are:
- Are there any glaring issues with this approach for modeling a hierarchical Dirichlet regression GLM?
- Is there a more 'bambi-esque` way to include the studies as a random variable in the design matrix for the model as opposed to manually indexing each study-specific descriptor?
Thanks as always to the community for the help with these technical questions.
Kev September 4, 2024, 3:35am 5
Hi. I too am trying to make a regression where the Ys are float values of subcategories that sum up to 1 (observed dirichlet outputs). Your links are helpful, but uses pm.Categorical which assumes discreet count data as the response variable, where as the data in trying to model is continuous (i.e a single trial can produce fractions of multiple categories). Do you have an example of this?
ckrapu September 8, 2024, 3:51am 6
@Kev, does the previous response from zult suitable to answer your question? You may also want to consult this link: Dirichlet Regression with PyMC | R-bloggers.
The softmax function is suitable for transforming a linear, unconstrained variable to the unit simplex as desired in Dirichlet regression, so you should be able to take any examples using multinomial/softmax regression and simply replace the Categorical likelihood with the Dirichlet one to come up with a valid model.
My mistake - the transformation just needs to be positive, as @ricardoV94 mentioned below.
You don’t need the softmax for a Dirichlet regression right? Since alpha is just constrained to be positive
Kev September 15, 2024, 4:31am 8
Thanks!