Automatic differentiation package - torch.autograd — PyTorch 2.7 documentation (original) (raw)

torch.autograd provides classes and functions implementing automatic differentiation of arbitrary scalar valued functions.

It requires minimal changes to the existing code - you only need to declare Tensor s for which gradients should be computed with the requires_grad=True keyword. As of now, we only support autograd for floating point Tensor types ( half, float, double and bfloat16) and complex Tensor types (cfloat, cdouble).

backward Compute the sum of gradients of given tensors with respect to graph leaves.
grad Compute and return the sum of gradients of outputs with respect to the inputs.

Forward-mode Automatic Differentiation

Warning

This API is in beta. Even though the function signatures are very unlikely to change, improved operator coverage is planned before we consider this stable.

Please see the forward-mode AD tutorialfor detailed steps on how to use this API.

forward_ad.dual_level Context-manager for forward AD, where all forward AD computation must occur within the dual_level context.
forward_ad.make_dual Associate a tensor value with its tangent to create a "dual tensor" for forward AD gradient computation.
forward_ad.unpack_dual Unpack a "dual tensor" to get both its Tensor value and its forward AD gradient.
forward_ad.enter_dual_level Enter a new forward grad level.
forward_ad.exit_dual_level Exit a forward grad level.
forward_ad.UnpackedDualTensor Namedtuple returned by unpack_dual() containing the primal and tangent components of the dual tensor.

Functional higher level API

Warning

This API is in beta. Even though the function signatures are very unlikely to change, major improvements to performances are planned before we consider this stable.

This section contains the higher level API for the autograd that builds on the basic API above and allows you to compute jacobians, hessians, etc.

This API works with user-provided functions that take only Tensors as input and return only Tensors. If your function takes other arguments that are not Tensors or Tensors that don’t have requires_grad set, you can use a lambda to capture them. For example, for a function f that takes three inputs, a Tensor for which we want the jacobian, another tensor that should be considered constant and a boolean flag as f(input, constant, flag=flag)you can use it as functional.jacobian(lambda x: f(x, constant, flag=flag), input).

functional.jacobian Compute the Jacobian of a given function.
functional.hessian Compute the Hessian of a given scalar function.
functional.vjp Compute the dot product between a vector v and the Jacobian of the given function at the point given by the inputs.
functional.jvp Compute the dot product between the Jacobian of the given function at the point given by the inputs and a vector v.
functional.vhp Compute the dot product between vector v and Hessian of a given scalar function at a specified point.
functional.hvp Compute the dot product between the scalar function's Hessian and a vector v at a specified point.

Default gradient layouts

When a non-sparse param receives a non-sparse gradient duringtorch.autograd.backward() or torch.Tensor.backward() param.grad is accumulated as follows.

If param.grad is initially None:

  1. If param’s memory is non-overlapping and dense, .grad is created with strides matching param (thus matching param’s layout).
  2. Otherwise, .grad is created with rowmajor-contiguous strides.

If param already has a non-sparse .grad attribute:

  1. If create_graph=False, backward() accumulates into .gradin-place, which preserves its strides.
  2. If create_graph=True, backward() replaces .grad with a new tensor .grad + new grad, which attempts (but does not guarantee) matching the preexisting .grad’s strides.

The default behavior (letting .grads be None before the firstbackward(), such that their layout is created according to 1 or 2, and retained over time according to 3 or 4) is recommended for best performance. Calls to model.zero_grad() or optimizer.zero_grad() will not affect .gradlayouts.

In fact, resetting all .grads to None before each accumulation phase, e.g.:

for iterations... ... for param in model.parameters(): param.grad = None loss.backward()

such that they’re recreated according to 1 or 2 every time, is a valid alternative to model.zero_grad() or optimizer.zero_grad()that may improve performance for some networks.

Manual gradient layouts

If you need manual control over .grad’s strides, assign param.grad = a zeroed tensor with desired strides before the first backward(), and never reset it to None. 3 guarantees your layout is preserved as long as create_graph=False. 4 indicates your layout is likely preserved even if create_graph=True.

In-place operations on Tensors

Supporting in-place operations in autograd is a hard matter, and we discourage their use in most cases. Autograd’s aggressive buffer freeing and reuse makes it very efficient and there are very few occasions when in-place operations actually lower memory usage by any significant amount. Unless you’re operating under heavy memory pressure, you might never need to use them.

In-place correctness checks

All Tensor s keep track of in-place operations applied to them, and if the implementation detects that a tensor was saved for backward in one of the functions, but it was modified in-place afterwards, an error will be raised once backward pass is started. This ensures that if you’re using in-place functions and not seeing any errors, you can be sure that the computed gradients are correct.

Variable (deprecated)

Warning

The Variable API has been deprecated: Variables are no longer necessary to use autograd with tensors. Autograd automatically supports Tensors withrequires_grad set to True. Below please find a quick guide on what has changed:

In addition, one can now create tensors with requires_grad=True using factory methods such as torch.randn(), torch.zeros(), torch.ones(), and others like the following:

autograd_tensor = torch.randn((2, 3, 4), requires_grad=True)

Tensor autograd functions

torch.Tensor.grad This attribute is None by default and becomes a Tensor the first time a call to backward() computes gradients for self.
torch.Tensor.requires_grad Is True if gradients need to be computed for this Tensor, False otherwise.
torch.Tensor.is_leaf All Tensors that have requires_grad which is False will be leaf Tensors by convention.
torch.Tensor.backward([gradient, ...]) Computes the gradient of current tensor wrt graph leaves.
torch.Tensor.detach Returns a new Tensor, detached from the current graph.
torch.Tensor.detach_ Detaches the Tensor from the graph that created it, making it a leaf.
torch.Tensor.register_hook(hook) Registers a backward hook.
torch.Tensor.register_post_accumulate_grad_hook(hook) Registers a backward hook that runs after grad accumulation.
torch.Tensor.retain_grad() Enables this Tensor to have their grad populated during backward().

Function

class torch.autograd.Function(*args, **kwargs)[source][source]

Base class to create custom autograd.Function.

To create a custom autograd.Function, subclass this class and implement the forward() and backward() static methods. Then, to use your custom op in the forward pass, call the class method apply. Do not callforward() directly.

To ensure correctness and best performance, make sure you are calling the correct methods on ctx and validating your backward function usingtorch.autograd.gradcheck().

See Extending torch.autograd for more details on how to use this class.

Examples:

class Exp(Function): @staticmethod def forward(ctx, i): result = i.exp() ctx.save_for_backward(result) return result

@staticmethod
def backward(ctx, grad_output):
    result, = ctx.saved_tensors
    return grad_output * result

Use it by calling the apply method:

output = Exp.apply(input)

Function.forward Define the forward of the custom autograd Function.
Function.backward Define a formula for differentiating the operation with backward mode automatic differentiation.
Function.jvp Define a formula for differentiating the operation with forward mode automatic differentiation.
Function.vmap Define the behavior for this autograd.Function underneath torch.vmap().

Context method mixins

When creating a new Function, the following methods are available to ctx.

Custom Function utilities

Decorator for backward method.

Base custom Function used to build PyTorch utilities

Numerical gradient checking

gradcheck Check gradients computed via small finite differences against analytical gradients wrt tensors in inputs that are of floating point or complex type and with requires_grad=True.
gradgradcheck Check gradients of gradients computed via small finite differences against analytical gradients wrt tensors in inputs and grad_outputs that are of floating point or complex type and with requires_grad=True.
GradcheckError Error raised by gradcheck() and gradgradcheck().

Profiler

Autograd includes a profiler that lets you inspect the cost of different operators inside your model - both on the CPU and GPU. There are three modes implemented at the moment - CPU-only using profile. nvprof based (registers both CPU and GPU activity) usingemit_nvtx. and vtune profiler based usingemit_itt.

class torch.autograd.profiler.profile(enabled=True, *, use_cuda=False, use_device=None, record_shapes=False, with_flops=False, profile_memory=False, with_stack=False, with_modules=False, use_kineto=False, use_cpu=True, experimental_config=None, acc_events=False, custom_trace_id_callback=None)[source][source]

Context manager that manages autograd profiler state and holds a summary of results.

Under the hood it just records events of functions being executed in C++ and exposes those events to Python. You can wrap any code into it and it will only report runtime of PyTorch functions. Note: profiler is thread local and is automatically propagated into the async tasks

Parameters

Example

x = torch.randn((1, 1), requires_grad=True) with torch.autograd.profiler.profile() as prof: for _ in range(100): # any normal python code, really! y = x ** 2 y.backward()

NOTE: some columns were removed for brevity

print(prof.key_averages().table(sort_by="self_cpu_time_total"))


Name Self CPU total CPU time avg Number of Calls


mul 32.048ms 32.048ms 200 pow 27.041ms 27.041ms 200 PowBackward0 9.727ms 55.483ms 100 torch::autograd::AccumulateGrad 9.148ms 9.148ms 100 torch::autograd::GraphRoot 691.816us 691.816us 100


profiler.profile.export_chrome_trace Export an EventList as a Chrome tracing tools file.
profiler.profile.key_averages Averages all function events over their keys.
profiler.profile.self_cpu_time_total Returns total time spent on CPU.
profiler.profile.total_average Averages all events.
profiler.parse_nvprof_trace
profiler.EnforceUnique Raises an error if a key is seen more than once.
profiler.KinetoStepTracker Provides an abstraction for incrementing the step count globally.
profiler.record_function Context manager/function decorator that adds a label to a code block/function when running autograd profiler.
profiler_util.Interval
profiler_util.Kernel
profiler_util.MemRecordsAcc Acceleration structure for accessing mem_records in interval.
profiler_util.StringTable

class torch.autograd.profiler.emit_nvtx(enabled=True, record_shapes=False)[source][source]

Context manager that makes every autograd operation emit an NVTX range.

It is useful when running the program under nvprof:

nvprof --profile-from-start off -o trace_name.prof --

Unfortunately, there’s no way to force nvprof to flush the data it collected to disk, so for CUDA profiling one has to use this context manager to annotate nvprof traces and wait for the process to exit before inspecting them. Then, either NVIDIA Visual Profiler (nvvp) can be used to visualize the timeline, ortorch.autograd.profiler.load_nvprof() can load the results for inspection e.g. in Python REPL.

Parameters

Example

with torch.cuda.profiler.profile(): ... model(x) # Warmup CUDA memory allocator and profiler ... with torch.autograd.profiler.emit_nvtx(): ... model(x)

Forward-backward correlation

When viewing a profile created using emit_nvtx in the Nvidia Visual Profiler, correlating each backward-pass op with the corresponding forward-pass op can be difficult. To ease this task, emit_nvtx appends sequence number information to the ranges it generates.

During the forward pass, each function range is decorated with seq=<N>. seq is a running counter, incremented each time a new backward Function object is created and stashed for backward. Thus, the seq=<N> annotation associated with each forward function range tells you that if a backward Function object is created by this forward function, the backward object will receive sequence number N. During the backward pass, the top-level range wrapping each C++ backward Function’sapply() call is decorated with stashed seq=<M>. M is the sequence number that the backward object was created with. By comparing stashed seq numbers in backward with seqnumbers in forward, you can track down which forward op created each backward Function.

Any functions executed during the backward pass are also decorated with seq=<N>. During default backward (with create_graph=False) this information is irrelevant, and in fact,N may simply be 0 for all such functions. Only the top-level ranges associated with backward Function objects’ apply() methods are useful, as a way to correlate these Function objects with the earlier forward pass.

Double-backward

If, on the other hand, a backward pass with create_graph=True is underway (in other words, if you are setting up for a double-backward), each function’s execution during backward is given a nonzero, useful seq=<N>. Those functions may themselves create Function objects to be executed later during double-backward, just as the original functions in the forward pass did. The relationship between backward and double-backward is conceptually the same as the relationship between forward and backward: The functions still emit current-sequence-number-tagged ranges, the Function objects they create still stash those sequence numbers, and during the eventual double-backward, the Function objects’ apply() ranges are still tagged with stashed seqnumbers, which can be compared to seq numbers from the backward pass.

class torch.autograd.profiler.emit_itt(enabled=True, record_shapes=False)[source][source]

Context manager that makes every autograd operation emit an ITT range.

It is useful when running the program under Intel(R) VTune Profiler:

vtune <--vtune-flags>

The Instrumentation and Tracing Technology (ITT) API enables your application to generate and control the collection of trace data during its execution across different Intel tools. This context manager is to annotate Intel(R) VTune Profiling trace. With help of this context manager, you will be able to see labled ranges in Intel(R) VTune Profiler GUI.

Parameters

Example

with torch.autograd.profiler.emit_itt(): ... model(x)

Debugging and anomaly detection

class torch.autograd.detect_anomaly(check_nan=True)[source][source]

Context-manager that enable anomaly detection for the autograd engine.

This does two things:

Warning

This mode should be enabled only for debugging as the different tests will slow down your program execution.

Example

import torch from torch import autograd class MyFunc(autograd.Function): ... @staticmethod ... def forward(ctx, inp): ... return inp.clone() ... @staticmethod ... def backward(ctx, gO): ... # Error during the backward pass ... raise RuntimeError("Some error in backward") ... return gO.clone() def run_fn(a): ... out = MyFunc.apply(a) ... return out.sum() inp = torch.rand(10, 10, requires_grad=True) out = run_fn(inp) out.backward() Traceback (most recent call last): File "", line 1, in File "/your/pytorch/install/torch/_tensor.py", line 93, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/your/pytorch/install/torch/autograd/init.py", line 90, in backward allow_unreachable=True) # allow_unreachable flag File "/your/pytorch/install/torch/autograd/function.py", line 76, in apply return self._forward_cls.backward(self, *args) File "", line 8, in backward RuntimeError: Some error in backward with autograd.detect_anomaly(): ... inp = torch.rand(10, 10, requires_grad=True) ... out = run_fn(inp) ... out.backward() Traceback of forward call that caused the error: File "tmp.py", line 53, in out = run_fn(inp) File "tmp.py", line 44, in run_fn out = MyFunc.apply(a) Traceback (most recent call last): File "", line 4, in File "/your/pytorch/install/torch/_tensor.py", line 93, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/your/pytorch/install/torch/autograd/init.py", line 90, in backward allow_unreachable=True) # allow_unreachable flag File "/your/pytorch/install/torch/autograd/function.py", line 76, in apply return self._forward_cls.backward(self, *args) File "", line 8, in backward RuntimeError: Some error in backward

class torch.autograd.set_detect_anomaly(mode, check_nan=True)[source][source]

Context-manager that sets the anomaly detection for the autograd engine on or off.

set_detect_anomaly will enable or disable the autograd anomaly detection based on its argument mode. It can be used as a context-manager or as a function.

See detect_anomaly above for details of the anomaly detection behaviour.

Parameters

Autograd graph

Autograd exposes methods that allow one to inspect the graph and interpose behavior during the backward pass.

The grad_fn attribute of a torch.Tensor holds a torch.autograd.graph.Nodeif the tensor is the output of a operation that was recorded by autograd (i.e., grad_mode is enabled and at least one of the inputs required gradients), or None otherwise.

graph.Node.name Return the name.
graph.Node.metadata Return the metadata.
graph.Node.next_functions
graph.Node.register_hook Register a backward hook.
graph.Node.register_prehook Register a backward pre-hook.
graph.increment_version Update autograd metadata tracking whether the given Tensor was modified in place.

Some operations need intermediary results to be saved during the forward pass in order to execute the backward pass. These intermediary results are saved as attributes on the grad_fn and can be accessed. For example:

a = torch.tensor([0., 0., 0.], requires_grad=True) b = a.exp() print(isinstance(b.grad_fn, torch.autograd.graph.Node)) True print(dir(b.grad_fn)) ['call', 'class', 'delattr', 'dir', 'doc', 'eq', 'format', 'ge', 'getattribute', 'gt', 'hash', 'init', 'init_subclass', 'le', 'lt', 'ne', 'new', 'reduce', 'reduce_ex', 'repr', 'setattr', 'sizeof', 'str', 'subclasshook', '_raw_saved_result', '_register_hook_dict', '_saved_result', 'metadata', 'name', 'next_functions', 'register_hook', 'register_prehook', 'requires_grad'] print(torch.allclose(b.grad_fn._saved_result, b)) True

You can also define how these saved tensors should be packed / unpacked using hooks. A common application is to trade compute for memory by saving those intermediary results to disk or to CPU instead of leaving them on the GPU. This is especially useful if you notice your model fits on GPU during evaluation, but not training. Also see Hooks for saved tensors.

class torch.autograd.graph.saved_tensors_hooks(pack_hook, unpack_hook)[source][source]

Context-manager that sets a pair of pack / unpack hooks for saved tensors.

Use this context-manager to define how intermediary results of an operation should be packed before saving, and unpacked on retrieval.

In that context, the pack_hook function will be called everytime an operation saves a tensor for backward (this includes intermediary results saved usingsave_for_backward() but also those recorded by a PyTorch-defined operation). The output ofpack_hook is then stored in the computation graph instead of the original tensor.

The unpack_hook is called when the saved tensor needs to be accessed, namely when executing torch.Tensor.backward() ortorch.autograd.grad(). It takes as argument the packed object returned by pack_hook and should return a tensor which has the same content as the original tensor (passed as input to the correspondingpack_hook).

The hooks should have the following signatures:

pack_hook(tensor: Tensor) -> Any

unpack_hook(Any) -> Tensor

where the return value of pack_hook is a valid input to unpack_hook.

In general, you want unpack_hook(pack_hook(t)) to be equal to t in terms of value, size, dtype and device.

Example:

def pack_hook(x): ... print("Packing", x) ... return x

def unpack_hook(x): ... print("Unpacking", x) ... return x

a = torch.ones(5, requires_grad=True) b = torch.ones(5, requires_grad=True) * 2 with torch.autograd.graph.saved_tensors_hooks(pack_hook, unpack_hook): ... y = a * b Packing tensor([1., 1., 1., 1., 1.], requires_grad=True) Packing tensor([2., 2., 2., 2., 2.], grad_fn=) y.sum().backward() Unpacking tensor([1., 1., 1., 1., 1.], requires_grad=True) Unpacking tensor([2., 2., 2., 2., 2.], grad_fn=)

Warning

Performing an inplace operation on the input to either hooks may lead to undefined behavior.

Warning

Only one pair of hooks is allowed at a time. When recursively nesting this context-manager, only the inner-most pair of hooks will be applied.

class torch.autograd.graph.save_on_cpu(pin_memory=False, device_type='cuda')[source][source]

Context manager under which tensors saved by the forward pass will be stored on cpu, then retrieved for backward.

When performing operations within this context manager, intermediary results saved in the graph during the forward pass will be moved to CPU, then copied back to the original device when needed for the backward pass. If the graph was already on CPU, no tensor copy is performed.

Use this context-manager to trade compute for GPU memory usage (e.g. when your model doesn’t fit in GPU memory during training).

Parameters

pin_memory (bool) – If True tensors will be saved to CPU pinned memory during packing and copied to GPU asynchronously during unpacking. Defaults to False. Also see Use pinned memory buffers.

Example:

a = torch.randn(5, requires_grad=True, device="cuda") b = torch.randn(5, requires_grad=True, device="cuda") c = torch.randn(5, requires_grad=True, device="cuda")

def f(a, b, c): ... prod_1 = a * b # a and b are saved on GPU ... with torch.autograd.graph.save_on_cpu(): ... prod_2 = prod_1 * c # prod_1 and c are saved on CPU ... y = prod_2 * a # prod_2 and a are saved on GPU ... return y

y = f(a, b, c) del a, b, c # for illustration only

the content of a, b, and prod_2 are still alive on GPU

the content of prod_1 and c only live on CPU

y.sum().backward() # all CPU tensors are moved back to GPU, for backward

all intermediary tensors are released (deleted) after the call to backward

class torch.autograd.graph.disable_saved_tensors_hooks(error_message)[source][source]

Context-manager that disables the saved tensors default hooks feature.

Useful for if you are creating a feature that does not work with saved tensors default hooks.

Parameters

error_message (str) – When saved tensors default hooks are used when they have been are disabled, a RuntimeError with this error message gets raised.

Return type

Generator[None, None, None]

Example:

message = "saved tensors default hooks are disabled" with torch.autograd.graph.disable_saved_tensors_hooks(message): ... # Raises RuntimeError: saved tensors default hooks are disabled ... with torch.autograd.graph.save_on_cpu(): ... pass

class torch.autograd.graph.register_multi_grad_hook(tensors, fn, *, mode='all')[source][source]

Register a multi-grad backward hook.

There are two supported modes: "all" and "any".

Under the "all" mode, the hook will be called after gradients with respect to every tensor intensors have been computed. If a tensor is in tensors but is not part of the graph, or if a tensor is not needed to compute the gradients for any inputs specified for the current .backward() or .grad() call, this tensor will be ignored and the hook will not wait for its gradient to be computed.

After every non-ignored tensor’s gradient has been computed, fn will be called with those gradients. None will be passed for tensors that did not have their gradients computed.

Under the "any" mode, the hook will be called after the first gradient with respect to a tensor in tensors has been computed. The hook will be called with that gradient as its argument.

The hook should not modify its arguments.

This function returns a handle with a method handle.remove() that removes the hook.

Note

See Backward Hooks execution for more information on how when this hook is executed, and how its execution is ordered relative to other hooks.

Example:

import torch

a = torch.rand(2, 3, requires_grad=True) b = torch.rand(2, 3, requires_grad=True) c = a * b d = a * b

def fn(grads): ... print([g is not None for g in grads]) ... torch.autograd.graph.register_multi_grad_hook((a, b, c, d), fn)

c.sum().backward(retain_graph=True) [True, True, True, False] c.sum().backward(inputs=(a,), retain_graph=True) [True, False, True, False]

Return type

RemovableHandle

class torch.autograd.graph.allow_mutation_on_saved_tensors[source][source]

Context manager under which mutating tensors saved for backward is allowed.

Under this context manager, tensors saved for backward are cloned on mutation, so the original version can still be used during backward. Normally, mutating a tensor saved for backward will result in an error raised when it’s used during backward.

To ensure the correct behavior, both the forward and backward should be run under the same context manager.

Returns

An _AllowMutationOnSavedContext object storing the state managed by this context manager. This object can be useful for debugging purposes. The state managed by the context manager is automatically cleared upon exiting.

Return type

Generator[_AllowMutationOnSavedContext, None, None]

Example:

import torch with torch.autograd.graph.allow_mutation_on_saved_tensors(): ... # forward ... a = torch.ones(2, 3, requires_grad=True) ... b = a.clone() ... out = (b**2).sum() ... b.sin_() ... # backward ... out.sum().backward() ... tensor([[0.8415, 0.8415, 0.8415], [0.8415, 0.8415, 0.8415]], grad_fn=)

class torch.autograd.graph.GradientEdge(node, output_nr)[source][source]

Object representing a given gradient edge within the autograd graph.

To get the gradient edge where a given Tensor gradient will be computed, you can do edge = autograd.graph.get_gradient_edge(tensor).

torch.autograd.graph.get_gradient_edge(tensor)[source][source]

Get the gradient edge for computing the gradient of the given Tensor.

In particular, it is equivalent to callg = autograd.grad(loss, input) and g = autograd.grad(loss, get_gradient_edge(input)).

Return type

GradientEdge