Regression Models with Ordered Categorical Outcomes: clarification of Example (original) (raw)

Something in the example at Regression Models with Ordered Categorical Outcomes — PyMC example gallery has thrown me off.
This is showing how to use a Dirichlet distribution for priors over cutpoints in an ordered response model.
What’s confusing to me is that the Dirichlet-version cutpoints are constrained to lie in the [0,1] interval, yet there is (just as in the other, sorted normals approach) no scale nor offset (constant) allowed for eta. Surely one normally needs either flexible cutpoints or flexible scale/offset??

Maybe you didn’t scroll far enough to the right in the display? It takes a Dirichlet, applies cumulative sum, then scales by multiplying by max - min and then adding min.

The other way to do this directly is by taking a base value c[0] and then defining c[1] = c[0] + delta[1], where you constrain delta > 0, for example by drawing it from a lognormal distribution. Lognormal also helps the boundaries collapsing to each other by making sure delta > 0 (the lognormal density goes to zero as the variate goes to zero).

If you’re OK with making them all positive, you can just take c_\text{diffs} \sim \textrm{lognormal}(...) and then setting c = \textrm{cumulativeSum}(c_\text{diffs}). Or you can just model the diffs as lognormal and give c[0] some kind of unconstrained prior.

Velochy April 29, 2025, 8:11am 3

Yet another option is to use the ordered transform:

        cp_scale = pm.HalfNormal(f"cutpoint_sd", sigma=2)
        cp_len = len(obs.columns)-1
        
        cutpoints = pm.ZeroSumNormal(
            "cutpoints",
            sigma=cp_scale,
            transform=pm.distributions.transforms.ordered,
            initval=np.linspace(-1,1,cp_len),
            shape=cp_len
        )

I remember the ordered not behaving nicely with the zero-sum-constraint. In general it’s better to encode non-default constraints in the generative graph, so that the forward model matches with the model logp

Thanks, Bob (et al). I think I understand what you’re suggesting. My query was just because min and max seemed to be fixed, not parameters to be estimated. At the moment I had adapted the gallery example to get this:

def unconstrainedUniform(K, itsname='cutpoints'):                                                    
    # Offset and scale for cutpoints                                                                                          
    c0 = pm.Normal(itsname+"_c0", 0, 3)                                                                                     
    s = pm.LogNormal(itsname+"_scale", )                                                                     
    return pm.Deterministic(                                                                                                  
        itsname,                                                                                                              
        c0 + s * pt.concatenate(                                                                                              
            [                                                                                                                 
                np.ones(1)*0,                                                                                                                    
                pt.extra_ops.cumsum(pm.Dirichlet(itsname+"_interior", a=np.ones(K - 2))) ,                                
            ]                                                                                                                 
        ),                                                                                                                    
    )

Cutpoint models have a multiplicative non-identifiability that needs to be identified. You have covariates x \in \mathbb{R}^{N \times K} and a regression coefficient vector \beta \in \mathbb{R}^K and you compare the linear predictor x \cdot \beta to the cutpoints e.g., x \cdot \beta < c_n. you can multiply \beta by a constant and the outpoint vector c by a constant and get the same result. So you need to identify the scale in some way. One way to do that is to fix max and min. You could just leave it as a simplex without a max and min and that would identify the scale.

Well, I know this is supposed to be simple, so sorry I’m obviously missing something.
To me it seems that rescaling β and c by the same constant is not a symmetry, ie does not give the same probabilities, unless you also rescale epsilon in the link function. But in our case epsilon is fixed.

Similarly, what is the additive identity constraint/symmetry? At the moment, there is neither a free offset parameter for the cutpoints, nor is there a constant term in eta.
I do not see the translational symmetry that one introduces by including one of those.

If I compare to 0-to-1 cutpoints approach to the Normals cutpoint approach, neither the scale nor location of the set of normals is pinned.

Sorry—I got the wrong non-identifiability. There’s an additive non-identifiability, not a multiplicative one.

Here’s the definition of the ordered logistic distribution from the Stan Functions Reference (indexed from 1 as is traditional in stats, but presumably PyMC indexes from 0):

If K \in \mathbb{N} with K > 2, c \in \mathbb{R}^{K-1} such that
c_k < c_{k+1} for k \in \{1,\ldots,K-2\}, and \eta \in \mathbb{R}, then for k \in \{1,\ldots,K\},

\text{OrderedLogistic}(k~|~\eta,c) = \left\{ \begin{array}{ll} 1 - \text{logit}^{-1}(\eta - c_1) & \text{if } k = 1, \\[4pt] \text{logit}^{-1}(\eta - c_{k-1}) - \text{logit}^{-1}(\eta - c_{k}) & \text{if } 1 < k < K, \text{and} \\[4pt] \text{logit}^{-1}(\eta - c_{K-1}) - 0 & \text{if } k = K. \end{array} \right.

From this definition, it’s clear to see the additive non-identifiability between \eta and c. This comes up again if \eta = x \cdot \beta is formulated as a regression and the regression has an intercept—then it’s the intercept term and the cutpoints that have the non-identifiability.

Thanks Bob.

Alas I’m no less confused.

I think what you’re saying is precisely why I asked my original question. The example code has an offset for neither \eta nor c_K, and I don’t understand how it can accommodate arbitrary data that way. It seems to me the \eta-c_K differences need to be able to find the right part of the sigmoid link to correctly fit the data, requiring the option to translate (in addition to the freedom of scaling beta). My revised code allows for a scaling factor for the cutpoints, but it also allows them to slide (instead of adding a constant to x\beta).
Where in this do we disagree?

I’m not sure what you’re doing, so I’m not trying to disagree with anything specific.

To get a cutpoint model to fit, something has to fix the location of either the cutpoints c or the linear predictors \eta. You can’t let the cutpoints c and linear predictor \eta shift freely or you don’t identify the differences. For example, adding an intercept as a column of x in \eta = x^\top \cdot \beta will be a problem if you have an offset parameter for the cutpoints.

A simple way to identify a cutpoint model is to set the smallest cutpoint to 0 and thus restrict the other ones to be positive.

The context (what I’m doing) is actually just the PyMC example here: Regression Models with Ordered Categorical Outcomes — PyMC example gallery

It seems to have neither offset (intercept or cutpoint offset).

I went back and read the original post again.

The approach you link does identify the location and scale of the cutpoints with max and min. So it constrains them to lie in (min, max), not necessarily in (0, 1). The cumulative sum over the Dirichlet gives you cutpoints in (0, 1) and then they gets called to range between min and max rather than 0 and 1. It actually forces the initial cutpoint to be min.

Scaling the regression coefficients beta gives you scale flexibility that’s only controlled by the prior. But it doesn’t give you a varying offset. I think you could add an intercept to those models if you fix min and max as this example does (min = 0 and max = K when there are K categories). What I would do is add an intercept and see how it gets fit with min and max fixed.

alj May 12, 2025, 12:47pm 13

Hi @ceebeelee - I thought I’d throw my two cents in here as I’ve struggled with ordinal outcomes in PyMC for literally years. I’ll preface this with being honest that I don’t have great mathematical knowledge but have sort of intuited my way through this.

I learned the best about this - and was introduced to it - with Kruschke’s chapter on ordinal regression in DBDA2. In that, he uses an latent normal/probit outcome, and he makes the point that in an ordinal variable with, say K = 7 response options, you have a system with K + 1 parameters. There are K-1 cutpoints to estimate and the location and scale of the latent distribution. Of course, there’s non-identifiability here, and we need to set two of these to constants. Kruschke basically keeps the whole procedure the same as in a Gaussian scenario, and he pins the top and bottom cutpoints to constants (on the response level, so e.g. 1.5 and 6.5 if there are 7 response options), and puts priors over the location and scale of the latent normal distribution - the location is usually somewhere in the middle of the scale, e.g. Normal(mu=3, sigma=3). In PyMC you can do this with the pm.OrderedProbit distribution, which accepts eta and sigma arguments (sigma set to 1 by default). If you combine this with the constrainedUniform prior, I think you’re good to go - it does indeed lock the first cutpoint by default, and the final cutpoint is determined by scaling (its always going to be 1 when drawn from the cumulative sum Dirichlet).

With the pm.OrderedLogistic distribution the scale is completely set by default, np.pi**2/3, so that parameter is locked. Setting the first cutpoint works, so something like this:

cut = pm.Deterministic('cuts',
                       pm.math.concatenate(
                           [
                               pm.math.ones(1) * -4, # start latent variable at -4
                               pm.Normal('cut', 
                                         mu=np.linspace(-3, 4, n_remaining_cuts), # from -3 to 4 on latent scale
                                         sigma=0.1,
                                         transform=pm.distributions.transforms.ordered),
                           ]
                       )
                      )

and you can set the prior on the intercept of your model to an appropriate value (here something around zero is good, it all depends how you want to scale your latent variable).

I think - but I might be wrong - that the constrainedUniform cutpoint prior just doesn’t play well with the default Ordered distributions because it underfits. With two cutpoints staked into the ground, the latent distribution can ‘slide’ left and right along the axis and find other cutpoints as needed, but the probit distribution can (if you let it) expand its tails out or in as needed. The logistic one is more stuck, being a bell-curve of a fixed scale, so can only move left and right, trying to accommodate the other cutpoints within the bounds. Depending on the dataset, this might be fine, or it might cause issues.

Like I say, I might be wrong about this - I don’t claim to fully understand the mathematical technicalities of this stuff, but I’ve seen my share of horrible R-hats on ordinal models (and re-read Kruschke a million times) and thought I’d weigh in!

I would not expect to see much difference between logit and probit models. Logit acts a lot like a scaled probit (or vice-versa), but is much much easier to compute. The tails are a bit different, but unless you. have extreme data, that’s unlikely to matter.

I’m not exactly a theorist, but this example is pretty easy to work through in technical detail because all the densities are relatively simple. Non-identification of a likelihood p(y \mid \theta) arises when two different values of \theta lead to the same sampling distribution, i.e., \theta \neq \theta', but p(y \mid \theta) = p(y \mid \theta'). It’s pretty easy to see with the ordinal logit (or probit) that the location of the cutpoints isn’t identified if there’s an intercept in the regression, because you can modify the intercept to match the location of the cutpoints. Similarly, the scale of the cutpoints isn’t determined because you can multiply to expand the scale and then do the same to all the regression predictors to match. So what you need to do is remove two degrees of freedom. One way to do that is to pin the upper and lower bound. That both determines the location and the scale of the cutpoints.

Edit: Let me expand with an example. Suppose we have \beta for our regression coefficients and \kappa for our cutpoints and y for our data, so our likelihood is p(y \mid \kappa, \beta). The multiplicative non-identifiability of scale is this,

p(y \mid \kappa, \beta) = p(y \mid c \cdot \kappa, c \cdot \beta)

for any c > 0. Similarly if we take \alpha to be the intercept, then the likelihood can be written p(y \mid \kappa, \beta, \alpha) (we’ve just renamed one of the \beta components \alpha), and we have

p(y \mid \kappa, \alpha, \beta) = p(y \mid \kappa + c, \alpha + c, \beta).

You can quasi-identify with priors so that the posterior is proper, but it can lead to challenging computation because of the correlations introduced into the posterior by the non-identifiability.

One issue that comes up is cutpoins collapsing into each other. You need a prior that keeps the cutpoints from doing that or computation gets bad because it wants to send unconstrained parameters to plus or minus infinity.

It seems that would be true if there were no error term in the model. But for a stochastic model the cut points separate \beta x + \varepsilon, not \beta x. I think the scale and shape of \varepsilon is fixed in most formulations.

Interesting. Did that get introduced because the simpler model didn’t account for dispersion and thus failed posterior predictive checks?

The Bayesian literature hasn’t really settled on terminology around identifiability because it’s not a technical problem for Bayes as long as you have proper priors. So let me try to be a bit more precise about what I mean. If you have a data distribution p(y \mid \theta) for data y and parameters \theta, then it’s identifiable if \theta \neq \theta' implies p(y \mid \theta) \neq p(y \mid \theta'). Maximum likelihood estimates (fixing y and finding the \theta that maximizes the density of y) only make sense with identifiable likelihoods.

If you take a non-identifiable likelihood and use improper flat priors, you get an improper posterior. That is, if we abuse notation and take our prior to be p(\theta) = c (you have nice notation with flat for this in PyMC), then the posterior p(\theta \mid y) will be improper. Specifically, it will have infinite mass ridges along the non-identifiabilties.

If you add a penalty term, like the L2 (quadratic) penalty in ridge regression, you can solve a penalized maximum likelihood problem even if the unpenalized form is not identifiable. I think the motivation was both Stein’s shrinkage results and that separable data in logistic regressions led to identifiability problems.

The analogue in Bayesian stats is something like taking \varepsilon_n in \beta \cdot x_n + \varepsilon_n to have a fixed scale distribution, like \varepsilon_n \sim \textrm{normal}(0, 1.3). This doesn’t change the likelihood—that’s still non-identifiable. But as soon as you put a prior on this \varepsilon, the multiplicative non-identifiability in the posterior gets resolved—that infinite ridge gets truncated to a long cigar that can hard to sample due to correlation, but at least it’s proper. I think in this case, something else will need to be done to mitigate that remaining additive non-identifiability.

Putting a prior on \beta or on the cutpoints c with a fixed scale will do the same thing in terms of eliminating the impropropriety in the posterior caused by non-identifiability in the likelihood.