nvmath.fft.compile_epilog — NVIDIA nvmath-python (original) (raw)

nvmath.fft.compile_epilog(

epilog_fn,

element_dtype,

user_info_dtype,

compute_capability=None,

)[source]#

Compile a Python function to LTO-IR to provide as an epilog function forfft() and plan().

Parameters:

epilog_fn – The epilog function to be compiled to LTO-IR. It must have the signature:epilog_fn(data_out, offset, data, user_info, reserved_for_future_use), and it essentially stores transformed data into data_out at offset.
element_dtype – The data type of the data_in argument, one of ['float32', 'float64', 'complex64', 'complex128']. It must have the same data type as that of the FFT operand for prolog functions or the FFT result for epilog functions.
user_info_dtype –
The data type of the user_info argument. It must be one of ['float32', 'float64', 'complex64', 'complex128'] or an object of type numba.types.Type. The offset is computed based on the memory layout (shape and strides) of the operand (input for prolog, output for epilog). If the user would like to pass additional tensor as user_info and access it based on the offset, it is crucial to know memory layout of the operand. Please note, the actual layout of the input tensor may differ from the layout of the tensor passed to fft call. To learn the memory layout of the input or output, please use stateful FFT API and nvmath.fft.FFT.get_input_layout() nvmath.fft.FFT.get_output_layout() respectively.
Note
Currently, in the callback, the position of the element in the input and output operands are described with a single flat offset, even if the original operand is multi-dimensional tensor.
compute_capability – The target compute capability, specified as a string ('80', '89', …). The default is the compute capability of the current device.

Returns:

The function compiled to LTO-IR as bytes object.

Examples

The cuFFT library expects the end user to manage scaling of the outputs, so in order to replicate the norm option found in other Python FFT libraries we can define an epilog which performs the scaling.

import cupy as cp import nvmath import math

Create the data for a batched 1-D FFT.

B, N = 256, 1024 a = cp.random.rand(B, N, dtype=cp.float64) + 1j * cp.random.rand(B, N, dtype=cp.float64)

Compute a normalization factor that will create unitary transforms.

norm_factor = 1.0 / math.sqrt(N)

Define the epilog function for the FFT.

def rescale(data_out, offset, data, user_info, unused): ... data_out[offset] = data * norm_factor

Compile the epilog to LTO-IR. In a system with GPUs that have different compute capability, the compute_capability option must be specified to thecompile_prolog or compile_epilog helpers. Alternatively, the epilog can be compiled in the context of the device where the FFT to which the epilog is provided is executed. In this case we use the current device context, where the operands have been created.

with cp.cuda.Device(): ... epilog = nvmath.fft.compile_epilog(rescale, "complex128", "complex128")

Perform the forward FFT, applying the rescaling as a epilog.

r = nvmath.fft.fft(a, axes=[-1], epilog=dict(ltoir=epilog))

Test that the fused FFT run result matches the result of other libraries.

s = cp.fft.fftn(a, axes=[-1], norm="ortho") assert cp.allclose(r, s)

Notes

The user must ensure that the specified argument types meet the requirements listed above.