Black box NUTS sampler with multidimensional parameters (original) (raw)
March 31, 2025, 12:48pm 1
Hi,
I wanted to develop a hierarchical model with a black box likelihood function. Initially I used the emcee package. However, I had difficulties with burning in so many parameters (about 100). So I wanted to try the NUTS implementation instead. However, I have already implemented the logliklihood function in such a way that it is usable for emcee and I do not want to rewrite everything for pymc, which seems to be possible using this tutorial Using a “black box” likelihood function — PyMC example gallery. However, my logliklihood function always requires a large vector containing all the parameters and this tutorial always assumes scalar values. I could get it to work with vector variables for the non-gradient version but not for the gradient version.
Do you have any suggestions on what I need to change to make it work with vectors instead of scalars?
My current not working Version looks as follows:
Using a “black box” likelihood function — PyMC example gallery
import pymc as pm
from pytensor.graph import Apply, Op
import pytensor.tensor as pt
from typing import List
import pytensor
from scipy.optimize import approx_fprime
def final_log_liklihood(parameter):
return combined_log_liklihood(parameter)[0] #My liklihood function
class LogLike(Op):
def make_node(self, m, data) -> Apply:
m = pt.as_tensor(m)
data = pt.as_tensor(data)
inputs = [m, data] # Include data
outputs = [pt.scalar(dtype='float64')]
return Apply(self, inputs, outputs)
def perform(self, node: Apply, inputs: List[np.ndarray], outputs: List[List[None]]) -> None:
# This is the method that compute numerical output
# given numerical inputs. Everything here is numpy arrays
m, data = inputs # this will contain my variables
# call our numpy log-likelihood function
loglike_eval = final_log_liklihood(m)
# Save the result in the outputs list provided by PyTensor
# There is one list per output, each containing another list
# pre-populated with a `None` where the result should be saved.
outputs[0][0] = np.asarray(loglike_eval)
def grad(
self, inputs: List[pt.TensorVariable], g: List[pt.TensorVariable]
) -> List[pt.TensorVariable]:
# NEW!
# the method that calculates the gradients - it actually returns the vector-Jacobian product
m, data = inputs
grad_wrt_m = loglikegrad_op(m, data)
# out_grad is a tensor of gradients of the Op outputs wrt to the function cost
[out_grad] = g
return [
pt.sum(out_grad * grad_wrt_m),
pytensor.gradient.grad_not_implemented(self, 1, data),
]
def finite_differences_loglike(m, data, eps=1e-7):
"""
Calculate the partial derivatives of a function at a set of values. The
derivatives are calculated using scipy approx_fprime function.
Parameters
----------
m, c: array_like
The parameters of the function for which we wish to define partial derivatives
Returns
-------
grad_wrt_m: array_like
Partial derivative wrt to the m parameter
"""
def inner_func(m, data):
return final_log_liklihood(m, data)
grad_wrt_m = approx_fprime(m, inner_func, eps, data)
return grad_wrt_m
class LogLikeGrad(Op):
def make_node(self, m, data) -> Apply:
m = pt.as_tensor(m)
data = pt.as_tensor(data)
inputs = [m, data]
# There are two outputs with the same type as data,
# for the partial derivatives wrt to m, c
outputs = [pt.scalar(dtype='float64'), pt.scalar(dtype='float64')]
return Apply(self, inputs, outputs)
def perform(self, node: Apply, inputs: List[np.ndarray], outputs: List[List[None]]) -> None:
m, data = inputs
# calculate gradients
grad_wrt_m = finite_differences_loglike(m, data)
outputs[0][0] = grad_wrt_m
loglike_op = LogLike()
loglikegrad_op = LogLikeGrad()
def custom_dist_loglike(data, m):
# data, or observed is always passed as the first input of CustomDist
return loglike_op(m, data)
test_out = loglike_op(np.random.uniform(0, 1, len(all_labels)), [])
pytensor.dprint(test_out, print_type=True)
with pm.Model():
# set priors on model gradient and y-intercept
params = pm.Uniform("p", lower=0, upper=1, shape=len(all_labels))
# create custom distribution
pm.CustomDist('likelihood', params,observed=[] ,logp=custom_dist_loglike)
# sample from the distribution
trace = pm.sample(1000)
Note: i’m working with Python 3.8.8
-i acually don’t need to pass any observed data. This is already handled inside the logliklihood function but i could not get the non gradient Version to work without any data variable. Therefore i made it an empty list.
The current error i get is:
ValueError: LogLike.grad returned a term with 0 dimensions, but 1 are required.
Thanks and best regards
Jan
Emcee doesn’t scale well in dimension because it’s underlyingly a form of random-walk Metropolis (with a fiendishly clever ensemble-based adaptive proposal covariance whose expectation is equal to the target covariance).
How complicated is your target density? Do you have analytic gradients or are you relying on finite differences? Finite differences are really slow (they require at least one log density eval per dimension) and really inaccurate (you lose about half the digits of precision in gradients).
Jan-L April 2, 2025, 8:26am 3
I sadly can’t provide a analytic gradient function as the function is to complicated. I will need to use some sort of numerical method.
Is the function complicated, or is the algorithm to computing the function complicated? I lost a lot of time on a project recently because I focused too much on the literal computations that were being done, instead of focusing on the problem itself. It turned out I was able to solve for analytical adjoints for the problem itself, and just ignore the specific algorithm.
Jan-L April 2, 2025, 1:21pm 5
The basic idea is that i have a measurement of that data points and a complicated algorithm that gives a predictions for these certain data points using a certain set of parameters, where the relation ship between the prediction and the parameters is to complicated to give an pure analytic expression. However, the function should be rather smooth.To evaluate how good the prediction is i use standard sqaure error and some sigma, i.e. I assume that it is normal distributed.
Jan-L April 3, 2025, 8:26am 6
However, due to the fact that this is a hirachical model, I think I can speed up the calculation of the gradient drastically. The idea would be that in order to calculate the gradient, the algorithm simply varies one parameter after another, and as this is a hirachical model, in most cases this will only affect one submodel. As a result, I can simply cache the results of the other models and don’t have to calculate them again. In this way I effectively only need to run my simulation for all the submodels as often as I have parameters for each submodel, which is much less (5). (I have about 20 submodels)
Are you thinking about Gibbs or random walk Metropolis? In a sampler like HMC, all the parameters change at once.
Jan-L April 7, 2025, 8:15am 8
No i was thinking about the numeric approximation with approx_fprime
of the gradients, which is needed for HMC. For the step itself the caching doesn’t save any time.