Regression Models with Ordered Categorical Outcomes: clarification of Example (original) (raw)
Something in the example at Regression Models with Ordered Categorical Outcomes — PyMC example gallery has thrown me off.
This is showing how to use a Dirichlet distribution for priors over cutpoints in an ordered response model.
What’s confusing to me is that the Dirichlet-version cutpoints are constrained to lie in the [0,1] interval, yet there is (just as in the other, sorted normals approach) no scale nor offset (constant) allowed for eta. Surely one normally needs either flexible cutpoints or flexible scale/offset??
Maybe you didn’t scroll far enough to the right in the display? It takes a Dirichlet, applies cumulative sum, then scales by multiplying by max - min
and then adding min
.
The other way to do this directly is by taking a base value c[0]
and then defining c[1] = c[0] + delta[1]
, where you constrain delta > 0
, for example by drawing it from a lognormal distribution. Lognormal also helps the boundaries collapsing to each other by making sure delta > 0
(the lognormal density goes to zero as the variate goes to zero).
If you’re OK with making them all positive, you can just take c_\text{diffs} \sim \textrm{lognormal}(...) and then setting c = \textrm{cumulativeSum}(c_\text{diffs}). Or you can just model the diffs as lognormal and give c[0]
some kind of unconstrained prior.
Velochy April 29, 2025, 8:11am 3
Yet another option is to use the ordered transform:
cp_scale = pm.HalfNormal(f"cutpoint_sd", sigma=2)
cp_len = len(obs.columns)-1
cutpoints = pm.ZeroSumNormal(
"cutpoints",
sigma=cp_scale,
transform=pm.distributions.transforms.ordered,
initval=np.linspace(-1,1,cp_len),
shape=cp_len
)
I remember the ordered not behaving nicely with the zero-sum-constraint. In general it’s better to encode non-default constraints in the generative graph, so that the forward model matches with the model logp
Thanks, Bob (et al). I think I understand what you’re suggesting. My query was just because min
and max
seemed to be fixed, not parameters to be estimated. At the moment I had adapted the gallery example to get this:
def unconstrainedUniform(K, itsname='cutpoints'):
# Offset and scale for cutpoints
c0 = pm.Normal(itsname+"_c0", 0, 3)
s = pm.LogNormal(itsname+"_scale", )
return pm.Deterministic(
itsname,
c0 + s * pt.concatenate(
[
np.ones(1)*0,
pt.extra_ops.cumsum(pm.Dirichlet(itsname+"_interior", a=np.ones(K - 2))) ,
]
),
)
Cutpoint models have a multiplicative non-identifiability that needs to be identified. You have covariates x \in \mathbb{R}^{N \times K} and a regression coefficient vector \beta \in \mathbb{R}^K and you compare the linear predictor x \cdot \beta to the cutpoints e.g., x \cdot \beta < c_n. you can multiply \beta by a constant and the outpoint vector c by a constant and get the same result. So you need to identify the scale in some way. One way to do that is to fix max and min. You could just leave it as a simplex without a max and min and that would identify the scale.
Well, I know this is supposed to be simple, so sorry I’m obviously missing something.
To me it seems that rescaling β and c by the same constant is not a symmetry, ie does not give the same probabilities, unless you also rescale epsilon in the link function. But in our case epsilon is fixed.
Similarly, what is the additive identity constraint/symmetry? At the moment, there is neither a free offset parameter for the cutpoints, nor is there a constant term in eta.
I do not see the translational symmetry that one introduces by including one of those.
If I compare to 0-to-1 cutpoints approach to the Normals cutpoint approach, neither the scale nor location of the set of normals is pinned.
Sorry—I got the wrong non-identifiability. There’s an additive non-identifiability, not a multiplicative one.
Here’s the definition of the ordered logistic distribution from the Stan Functions Reference (indexed from 1 as is traditional in stats, but presumably PyMC indexes from 0):
If K \in \mathbb{N} with K > 2, c \in \mathbb{R}^{K-1} such that
c_k < c_{k+1} for k \in \{1,\ldots,K-2\}, and \eta \in \mathbb{R}, then for k \in \{1,\ldots,K\},
\text{OrderedLogistic}(k~|~\eta,c) = \left\{ \begin{array}{ll} 1 - \text{logit}^{-1}(\eta - c_1) & \text{if } k = 1, \\[4pt] \text{logit}^{-1}(\eta - c_{k-1}) - \text{logit}^{-1}(\eta - c_{k}) & \text{if } 1 < k < K, \text{and} \\[4pt] \text{logit}^{-1}(\eta - c_{K-1}) - 0 & \text{if } k = K. \end{array} \right.
From this definition, it’s clear to see the additive non-identifiability between \eta and c. This comes up again if \eta = x \cdot \beta is formulated as a regression and the regression has an intercept—then it’s the intercept term and the cutpoints that have the non-identifiability.
Thanks Bob.
Alas I’m no less confused.
I think what you’re saying is precisely why I asked my original question. The example code has an offset for neither \eta nor c_K, and I don’t understand how it can accommodate arbitrary data that way. It seems to me the \eta-c_K differences need to be able to find the right part of the sigmoid link to correctly fit the data, requiring the option to translate (in addition to the freedom of scaling beta). My revised code allows for a scaling factor for the cutpoints, but it also allows them to slide (instead of adding a constant to x\beta).
Where in this do we disagree?
I’m not sure what you’re doing, so I’m not trying to disagree with anything specific.
To get a cutpoint model to fit, something has to fix the location of either the cutpoints c or the linear predictors \eta. You can’t let the cutpoints c and linear predictor \eta shift freely or you don’t identify the differences. For example, adding an intercept as a column of x in \eta = x^\top \cdot \beta will be a problem if you have an offset parameter for the cutpoints.
A simple way to identify a cutpoint model is to set the smallest cutpoint to 0 and thus restrict the other ones to be positive.
The context (what I’m doing) is actually just the PyMC example here: Regression Models with Ordered Categorical Outcomes — PyMC example gallery
It seems to have neither offset (intercept or cutpoint offset).
I went back and read the original post again.
The approach you link does identify the location and scale of the cutpoints with max
and min
. So it constrains them to lie in (min, max)
, not necessarily in (0, 1). The cumulative sum over the Dirichlet gives you cutpoints in (0, 1) and then they gets called to range between min
and max
rather than 0 and 1. It actually forces the initial cutpoint to be min
.
Scaling the regression coefficients beta
gives you scale flexibility that’s only controlled by the prior. But it doesn’t give you a varying offset. I think you could add an intercept to those models if you fix min
and max
as this example does (min = 0
and max = K
when there are K
categories). What I would do is add an intercept and see how it gets fit with min
and max
fixed.