torch.nn.functional.scaled_dot_product_attention — PyTorch 2.4 documentation (original) (raw)

torch.nn.functional.scaled_dot_product_attention(query, key, value, attn_mask=None, dropout_p=0.0, is_causal=False, scale=None) → Tensor:

Computes scaled dot product attention on query, key and value tensors, using an optional attention mask if passed, and applying dropout if a probability greater than 0.0 is specified. The optional scale argument can only be specified as a keyword argument.

Efficient implementation equivalent to the following:

def scaled_dot_product_attention(query, key, value, attn_mask=None, dropout_p=0.0, is_causal=False, scale=None) -> torch.Tensor: L, S = query.size(-2), key.size(-2) scale_factor = 1 / math.sqrt(query.size(-1)) if scale is None else scale attn_bias = torch.zeros(L, S, dtype=query.dtype) if is_causal: assert attn_mask is None temp_mask = torch.ones(L, S, dtype=torch.bool).tril(diagonal=0) attn_bias.masked_fill_(temp_mask.logical_not(), float("-inf")) attn_bias.to(query.dtype)

if attn_mask is not None:
    if attn_mask.dtype == torch.bool:
        attn_bias.masked_fill_(attn_mask.logical_not(), float("-inf"))
    else:
        attn_bias += attn_mask
attn_weight = query @ key.transpose(-2, -1) * scale_factor
attn_weight += attn_bias
attn_weight = torch.softmax(attn_weight, dim=-1)
attn_weight = torch.dropout(attn_weight, dropout_p, train=True)
return attn_weight @ value

Warning

This function is beta and subject to change.

Warning

This function always applies dropout according to the specified dropout_p argument. To disable dropout during evaluation, be sure to pass a value of 0.0 when the module that makes the function call is not in training mode.

For example:

class MyModel(nn.Module): def init(self, p=0.5): super().init() self.p = p

def forward(self, ...):
    return F.scaled_dot_product_attention(..., dropout_p=(self.p if self.training else 0.0))

Note

There are currently three supported implementations of scaled dot product attention:

The function may call optimized kernels for improved performance when using the CUDA backend. For all other backends, the PyTorch implementation will be used.

All implementations are enabled by default. Scaled dot product attention attempts to automatically select the most optimal implementation based on the inputs. In order to provide more fine-grained control over what implementation is used, the following functions are provided for enabling and disabling implementations. The context manager is the preferred mechanism:

Each of the fused kernels has specific input limitations. If the user requires the use of a specific fused implementation, disable the PyTorch C++ implementation using torch.nn.attention.sdpa_kernel(). In the event that a fused implementation is not available, a warning will be raised with the reasons why the fused implementation cannot run.

Due to the nature of fusing floating point operations, the output of this function may be different depending on what backend kernel is chosen. The c++ implementation supports torch.float64 and can be used when higher precision is required. For more information please see Numerical accuracy

Note

In some circumstances when given tensors on a CUDA device and using CuDNN, this operator may select a nondeterministic algorithm to increase performance. If this is undesirable, you can try to make the operation deterministic (potentially at a performance cost) by setting torch.backends.cudnn.deterministic = True. See Reproducibility for more information.

Parameters

Returns

Attention output; shape (N,...,L,Ev)(N, ..., L, Ev).

Return type

output (Tensor)

Shape legend:

Examples

Optionally use the context manager to ensure one of the fused kernels is run

query = torch.rand(32, 8, 128, 64, dtype=torch.float16, device="cuda") key = torch.rand(32, 8, 128, 64, dtype=torch.float16, device="cuda") value = torch.rand(32, 8, 128, 64, dtype=torch.float16, device="cuda") with torch.backends.cuda.sdp_kernel(enable_math=False): F.scaled_dot_product_attention(query,key,value)