Prior for Hierarchical Negative Binomial Regression (original) (raw)

April 10, 2025, 10:06am 1

Hey everyone,
I could use a bit of help from the community on a model convergence issue.

I’m working with a simple Negative Binomial (NB) regression. Everything runs smoothly with the pooled model, but when I switch to a hierarchical NB setup, convergence becomes unstable, Rhat values are consistently above 1.00.

I’ve done prior predictive checks and things seem okay, but my gut* tells me something might be off with how I’ve specified the hierarchical priors.

Does anyone see any red flags in my prior setup that could be messing with convergence?

Here’s a simplified model diagram and the relevant code. Happy to provide a full minimal working example with data if needed!

with pm.Model(coords=coords) as model:
    # Set the data
    subdivision_idx = pm.Data("subdivision_idx", data["subdivision_idx"].values, dims="survey")
    obs = pm.Data("obs", data["n_items"].values, dims="survey")

    # Priors
    nu = pm.Normal("nu", np.log(150), 1.5)  # <- I suppose the error here...
    kappa = pm.HalfNormal("kappa", 0.2)  # <- ... or here
    log_mu = pm.Normal("log_mu", nu, kappa, dims="group")
    mu = pm.Deterministic("mu", pt.tensor.exp(log_mu))

    # Likelihood
    observed = pm.NegativeBinomial(
        "litter",
        mu=mu[subdivision_idx],
        alpha=1,
        observed=obs,
    )

Looking at the ESS, kappa seems to be the problem here…

Thanks in advance! Appreciate any ideas or insights.

(*) For context, I ran into a similar issue recently when using a hierarchical prior for the dispersion parameter alpha, and switching to modeling 1/alpha instead helped a lot.

Hi, @nrieger!

Have you tried standardizing/scaling your data? It should improve the convergence when sampling and after sampling you can always recover your estimates on the original scale.

nrieger April 10, 2025, 11:38am 3

Thanks for the suggestion, @Dekermanjian! I did consider that, but I’m not sure it makes sense in my case. Since I’m working with count data, scaling it directly could mess with one of its key properties—namely, that it’s discrete.

That said, because Negative Binomial regression uses a log link, the data is effectively scaled in the model. In my case, log(mu) falls roughly between 3 and 7, and I’ve tried to reflect that in how I set up my priors.

Okay, I understand. I also noticed that you are setting alpha in the NegativeBinomial likelihood to a static value of 1. Is there a particular reason why you are doing this? I would try setting a prior on the alpha something like pm.Exponential(lam = 1/10)

You can try a non-centered parameterization for your hierarchy. This is often nicer to sample. A zero-avoiding prior on kappa, like Gamma(2,1), can also help, although you are technically making a strong assumption when you do that (that the the different sub-divisions must be at least somewhat different. Estimating sigma = 0 on the hierarchical parameter is a valid result, it rejects the present of hierarchy (e.g. all units are the same). In practice, I find that a very small but non-zero sigma is good enough.)

nrieger April 10, 2025, 2:52pm 6

Thanks @Dekermanjian and @jessegrabowski! The non-centered parametrization was exactly what I needed. I also tried using a Gamma prior, but it didn’t help, and since one of my goals is to test whether the hierarchy is even necessary, it wasn’t ideal anyway.

@Dekermanjian: I fixed alpha just to keep the example simple for my actual model, it’s inferred too.

bwengals April 10, 2025, 9:30pm 7

The prior on the negative binomial overdispersion parameter can be a bit dicey. Take a look at this. The penalized complexity (PC) prior has worked well for me. There’s an implementation in this gist.

nrieger April 11, 2025, 8:00am 8

Thanks @bwengals! I actually came across that STAN page on prior recommendations earlier, it helped me fix my initial issue with alpha (I ended up using 1/sqrt(alpha), which samples much more robustly now). Your implementation of the PC prior looks really interesting too! I’ll definitely give it a try!

nrieger April 14, 2025, 11:18am 9

hey @bwengals I just went through the Simpson paper on the PC prior and your notebook, really fantastic stuff! I think I’ve got the main idea: you’re using KL divergence to measure how far an over-dispersed model (with dispersion parameter \alpha) deviates from a base Poisson model. That makes sense to me.

Where I’m getting a bit stuck is in how you set the two free parameters that define the PC prior. In your notebook, you write:

\alpha \sim 1 / \text{Weibull}(x, a=0.5, b=1/\lambda^2)
I follow that part, but I’m not sure how you determine lambdalambdalambda. You define it as:

-\log(\epsilon) / d_{KL}(L^{-1})

Could you help me interpret what \epsilon and L represent in this context? From your inset plot, it seems like L might set the mode of \alpha, and \epsilon controls how much prior mass falls below that mode—but that’s just my guess.

How do you typically choose \epsilon and L in practice, based on your data or modeling goals? Any tips or references would be super helpful!

nrieger April 15, 2025, 4:15pm 10

OK i think I got it and my initial guess is correct. Rereading the original paper, I see that the parameters L and \epsilon match up with U and \alpha in the Simpson paper. So then according to Equation (3.2),
P(\alpha > L) = \epsilon \quad \implies \quad \exp(-\lambda L) = \epsilon \quad \implies \quad \lambda = -\frac{\ln(\epsilon)}{L}.

Dan Simpson et al.'s paper is great. Here’s the final version (we’re linking a draft in the Stan priors recommendations).

The basic idea is that you have a simple model and a complex model and you scale them both to constant variance then interpolate between them and rescale for the actual variance. The problem for me is figuring out the math to get the scaling of the base distributions right.

I hadn’t realized someone had worked out the negative binomial vs. Poisson. I always just penalize the over dispersion parameter, but the negative binomial is hard to fit because there are two ways to explain high values: higher mean or more dispersion. That leads to a lot of correlation in the posteriors, even with a strong shrinkage prior.

bwengals April 23, 2025, 8:26pm 12

How do you typically choose ϵ and L in practice, based on your data or modeling goals? Any tips or references would be super helpful!

Def take a look at the paper Bob posted. The short answer is that how you set those parameters is entirely up to your discretion. This is the part where PC priors are not uninformative. You’ll have to understand your problem, know what \alpha is doing, do some prior predictive simulation and then set L and \epsilon accordingly. All the PC priors work this way.

nrieger April 25, 2025, 7:38am 13

Thanks @bob-carpenter and @bwengals! I ended up figuring it out and was able to set meaningful values for used P(\alpha < \alpha_0)=\epsilon. The model runs smoothly now, and it feels good to have chosen priors with solid theoretical grounding. Really appreciate the pointer!