Jacobians, Hessians, hvp, vhp, and more: composing function transforms — PyTorch Tutorials 2.7.0+cu126 documentation (original) (raw)

intermediate/jacobians_hessians

Run in Google Colab

Colab

Download Notebook

Notebook

View on GitHub

GitHub

Note

Click hereto download the full example code

Created On: Mar 15, 2023 | Last Updated: Apr 18, 2023 | Last Verified: Nov 05, 2024

Computing jacobians or hessians are useful in a number of non-traditional deep learning models. It is difficult (or annoying) to compute these quantities efficiently using PyTorch’s regular autodiff APIs (Tensor.backward(), torch.autograd.grad). PyTorch’sJAX-inspired function transforms APIprovides ways of computing various higher-order autodiff quantities efficiently.

Note

This tutorial requires PyTorch 2.0.0 or later.

Computing the Jacobian¶

import torch import torch.nn.functional as F from functools import partial _ = torch.manual_seed(0)

Let’s start with a function that we’d like to compute the jacobian of. This is a simple linear function with non-linear activation.

Let’s add some dummy data: a weight, a bias, and a feature vector x.

Let’s think of predict as a function that maps the input x from \(R^D \to R^D\). PyTorch Autograd computes vector-Jacobian products. In order to compute the full Jacobian of this \(R^D \to R^D\) function, we would have to compute it row-by-row by using a different unit vector each time.

torch.Size([16, 16]) tensor([-0.5956, -0.6096, -0.1326, -0.2295, 0.4490, 0.3661, -0.1672, -1.1190, 0.1705, -0.6683, 0.1851, 0.1630, 0.0634, 0.6547, 0.5908, -0.1308])

Instead of computing the jacobian row-by-row, we can use PyTorch’storch.vmap function transform to get rid of the for-loop and vectorize the computation. We can’t directly apply vmap to torch.autograd.grad; instead, PyTorch provides a torch.func.vjp transform that composes withtorch.vmap:

In a later tutorial a composition of reverse-mode AD and vmap will give us per-sample-gradients. In this tutorial, composing reverse-mode AD and vmap gives us Jacobian computation! Various compositions of vmap and autodiff transforms can give us different interesting quantities.

PyTorch provides torch.func.jacrev as a convenience function that performs the vmap-vjp composition to compute jacobians. jacrev accepts an argnumsargument that says which argument we would like to compute Jacobians with respect to.

Let’s compare the performance of the two ways to compute the jacobian. The function transform version is much faster (and becomes even faster the more outputs there are).

In general, we expect that vectorization via vmap can help eliminate overhead and give better utilization of your hardware.

vmap does this magic by pushing the outer loop down into the function’s primitive operations in order to obtain better performance.

Let’s make a quick function to evaluate performance and deal with microseconds and milliseconds measurements:

def get_perf(first, first_descriptor, second, second_descriptor): """takes torch.benchmark objects and compares delta of second vs first.""" faster = second.times[0] slower = first.times[0] gain = (slower-faster)/slower if gain < 0: gain =-1 final_gain = gain100 print(f" Performance delta: {final_gain:.4f} percent improvement with {second_descriptor} ")

And then run the performance comparison:

<torch.utils.benchmark.utils.common.Measurement object at 0x7fb8130512d0> compute_jac(xp) 1.38 ms 1 measurement, 500 runs , 1 thread <torch.utils.benchmark.utils.common.Measurement object at 0x7fb853ef66e0> jacrev(predict, argnums=2)(weight, bias, x) 402.84 us 1 measurement, 500 runs , 1 thread

Let’s do a relative performance comparison of the above with our get_perf function:

Performance delta: 70.7505 percent improvement with vmap

Furthermore, it’s pretty easy to flip the problem around and say we want to compute Jacobians of the parameters to our model (weight, bias) instead of the input

Reverse-mode Jacobian (`jacrev`) vs forward-mode Jacobian (`jacfwd`)¶

We offer two APIs to compute jacobians: jacrev and jacfwd:

jacrev uses reverse-mode AD. As you saw above it is a composition of ourvjp and vmap transforms.
jacfwd uses forward-mode AD. It is implemented as a composition of ourjvp and vmap transforms.

jacfwd and jacrev can be substituted for each other but they have different performance characteristics.

As a general rule of thumb, if you’re computing the jacobian of an \(R^N \to R^M\)function, and there are many more outputs than inputs (for example, \(M > N\)) thenjacfwd is preferred, otherwise use jacrev. There are exceptions to this rule, but a non-rigorous argument for this follows:

In reverse-mode AD, we are computing the jacobian row-by-row, while in forward-mode AD (which computes Jacobian-vector products), we are computing it column-by-column. The Jacobian matrix has M rows and N columns, so if it is taller or wider one way we may prefer the method that deals with fewer rows or columns.

First, let’s benchmark with more inputs than outputs:

torch.Size([2048, 32]) jacfwd time: <torch.utils.benchmark.utils.common.Measurement object at 0x7fb92ea5ebf0> jacfwd(predict, argnums=2)(weight, bias, x) 748.49 us 1 measurement, 500 runs , 1 thread jacrev time: <torch.utils.benchmark.utils.common.Measurement object at 0x7fb84f2d6380> jacrev(predict, argnums=2)(weight, bias, x) 8.34 ms 1 measurement, 500 runs , 1 thread

and then do a relative benchmark:

Performance delta: 1013.8115 percent improvement with jacrev

and now the reverse - more outputs (M) than inputs (N):

jacfwd time: <torch.utils.benchmark.utils.common.Measurement object at 0x7fb813051960> jacfwd(predict, argnums=2)(weight, bias, x) 6.79 ms 1 measurement, 500 runs , 1 thread jacrev time: <torch.utils.benchmark.utils.common.Measurement object at 0x7fb84f217460> jacrev(predict, argnums=2)(weight, bias, x) 468.16 us 1 measurement, 500 runs , 1 thread

and a relative performance comparison:

Performance delta: 1350.1642 percent improvement with jacfwd

Hessian computation with functorch.hessian¶

We offer a convenience API to compute hessians: torch.func.hessiani. Hessians are the jacobian of the jacobian (or the partial derivative of the partial derivative, aka second order).

This suggests that one can just compose functorch jacobian transforms to compute the Hessian. Indeed, under the hood, hessian(f) is simply jacfwd(jacrev(f)).

Note: to boost performance: depending on your model, you may also want to use jacfwd(jacfwd(f)) or jacrev(jacrev(f)) instead to compute hessians leveraging the rule of thumb above regarding wider vs taller matrices.

from torch.func import hessian

lets reduce the size in order not to overwhelm Colab. Hessians require

significant memory:

Din = 512 Dout = 32 weight = torch.randn(Dout, Din) bias = torch.randn(Dout) x = torch.randn(Din)

hess_api = hessian(predict, argnums=2)(weight, bias, x) hess_fwdfwd = jacfwd(jacfwd(predict, argnums=2), argnums=2)(weight, bias, x) hess_revrev = jacrev(jacrev(predict, argnums=2), argnums=2)(weight, bias, x)

Let’s verify we have the same result regardless of using hessian API or using jacfwd(jacfwd()).

Batch Jacobian and Batch Hessian¶

In the above examples we’ve been operating with a single feature vector. In some cases you might want to take the Jacobian of a batch of outputs with respect to a batch of inputs. That is, given a batch of inputs of shape (B, N) and a function that goes from \(R^N \to R^M\), we would like a Jacobian of shape (B, M, N).

The easiest way to do this is to use vmap:

weight shape = torch.Size([33, 31])

If you have a function that goes from (B, N) -> (B, M) instead and are certain that each input produces an independent output, then it’s also sometimes possible to do this without using vmap by summing the outputs and then computing the Jacobian of that function:

If you instead have a function that goes from \(R^N \to R^M\) but inputs that are batched, you compose vmap with jacrev to compute batched jacobians:

Finally, batch hessians can be computed similarly. It’s easiest to think about them by using vmap to batch over hessian computation, but in some cases the sum trick also works.

torch.Size([64, 33, 31, 31])

Computing Hessian-vector products¶

The naive way to compute a Hessian-vector product (hvp) is to materialize the full Hessian and perform a dot-product with a vector. We can do better: it turns out we don’t need to materialize the full Hessian to do this. We’ll go through two (of many) different strategies to compute Hessian-vector products: - composing reverse-mode AD with reverse-mode AD - composing reverse-mode AD with forward-mode AD

Composing reverse-mode AD with forward-mode AD (as opposed to reverse-mode with reverse-mode) is generally the more memory efficient way to compute a hvp because forward-mode AD doesn’t need to construct an Autograd graph and save intermediates for backward:

from torch.func import jvp, grad, vjp

def hvp(f, primals, tangents): return jvp(grad(f), primals, tangents)[1]

Here’s some sample usage.

If PyTorch forward-AD does not have coverage for your operations, then we can instead compose reverse-mode AD with reverse-mode AD:

def hvp_revrev(f, primals, tangents): _, vjp_fn = vjp(grad(f), *primals) return vjp_fn(*tangents)

result_hvp_revrev = hvp_revrev(f, (x,), (tangent,)) assert torch.allclose(result, result_hvp_revrev[0])

Total running time of the script: ( 0 minutes 10.613 seconds)

Gallery generated by Sphinx-Gallery