nki.baremetal — AWS Neuron Documentation (original) (raw)

This document is relevant for: Inf2, Trn1, Trn2

nki.baremetal#

nki.baremetal(kernel=None, **kwargs)[source]#

Compile and run a NKI kernel on NeuronDevice without involving ML frameworks such as PyTorch and JAX. If you decorate your NKI kernel function with decorator @nki.baremetal(...), you may call the NKI kernel function directly just like any other Python function. You must run this API on a Trn/Inf instance with NeuronDevices (v2 or beyond) attached.

Note

The decorated function using nki.baremetal expectsnumpy.ndarray as input/output tensors instead of ML framework tensor objects.

This decorator compiles the NKI kernel into an executable on NeuronDevices (NEFF) and also collects an execution trace (NTFF) by running the NEFF on the local NeuronDevice. SeeProfiling NKI kernels with Neuron Profile for more information on how to visualize the execution trace for profiling purposes.

Since nki.baremetal runs the compiled NEFF without invoking any ML framework, it is the fastest way to compile and run any NKI kernel standalone on NeuronDevice. Therefore, this decorator is useful for quickly iterating an early implementation of a NKI kernel to reach functional correctness before porting it to the ML framework and injecting the kernel into the full ML model. To iterate over NKI kernel performance quickly, NKI also providesnki.benchmarkdecorator which uses the same underlying mechanism as nki.baremetal but additionally collects latency statistics in different percentiles.

Parameters:

Returns:

None

Listing 14 An Example#

from neuronxcc.nki import baremetal import neuronxcc.nki.language as nl import numpy as np

@baremetal(save_neff_name='file.neff', save_trace_name='profile.ntff') def nki_tensor_tensor_add(a_tensor, b_tensor): c_tensor = nl.ndarray(a_tensor.shape, dtype=a_tensor.dtype, buffer=nl.shared_hbm)

a = nl.load(a_tensor) b = nl.load(b_tensor)

c = a + b

nl.store(c_tensor, c)

return c_tensor

a = np.zeros([128, 1024], dtype=np.float32) b = np.random.random_sample([128, 1024]).astype(np.float32) c = nki_tensor_tensor_add(a, b)

assert np.allclose(c, a + b)

This document is relevant for: Inf2, Trn1, Trn2