nki.benchmark — AWS Neuron Documentation (original) (raw)
This document is relevant for: Inf2
, Trn1
, Trn2
nki.benchmark#
nki.benchmark(kernel=None, **kwargs)[source]#
Benchmark a NKI kernel on a NeuronDevice by using nki.benchmark
as a decorator. You must run this API on a Trn/Inf instance with NeuronDevices (v2 or beyond) attached and also aws-neuronx-tools
installed on the host using the following steps:
on Ubuntu
sudo apt-get install aws-neuronx-tools=2.* -y
on Amazon Linux
sudo yum install aws-neuronx-tools-2.* -y
You may specify a path to save your NEFF file through input parameter save_neff_name
and a path to save your NTFF file through save_trace_name
. See Profiling NKI kernels with Neuron Profile for more information on how to visualize the execution trace for profiling purposes.
Note
Similar to nki.baremetal
, The decorated function using nki.benchmark
expectsnumpy.ndarray as input/output tensors instead of ML framework tensor objects.
In additional to generating NEFF/NTFF files, this decorator also invokes neuron-bench
to collect execution latency statistics of the NEFF file and prints the statistics to the console.
neuron-bench
is a tool that launches the NEFF file on a NeuronDevice in a loop to collect end-to-end latency statistics. You may specify the number of warm-up iterations to skip benchmarking in input parameter warmup
, and the number of benchmarking iterations in iters
. Currently, nki.benchmark
only supports benchmarking on a single NeuronCore, since NKI not yet supports collective compute. Note, neuron-bench
measures not only the device latency but also the time taken to transfer data between host and device. However, the tool does not rely on any ML framework to launch the NEFF and therefore reports NEFF latency without any framework overhead.
Parameters:
- warmup – The number of iterations for warmup execution (10 by default).
- iters – The number of iterations for benchmarking (100 by default).
- save_neff_name – Save the compiled neff file if specify a name (unspecified by default).
- save_trace_name – Save the trace (profile) file if specified a name (unspecified by default); at the moment, it requires that the save_neff_name is unspecified or specified as ‘file.neff’.
- additional_compile_opt – Additional Neuron compiler flags to pass in when compiling the kernel.
Returns:
A function object that wraps the decorating function. A property benchmark_result.nc_latency
is available after invocation.get_latency_percentile(int)
of the property returns the specified percentile latency in microsecond(us). Available percentiles: [0, 1, 10, 25, 50, 90, 99, 100]
Listing 12 An Example#
from neuronxcc.nki import benchmark import neuronxcc.nki.language as nl import numpy as np
@benchmark(warmup=10, iters = 100, save_neff_name='file.neff', save_trace_name='profile.ntff') def nki_tensor_tensor_add(a_tensor, b_tensor): c_tensor = nl.ndarray(a_tensor.shape, dtype=a_tensor.dtype, buffer=nl.shared_hbm)
a = nl.load(a_tensor) b = nl.load(b_tensor)
c = a + b
nl.store(c_tensor, c)
return c_tensor
a = np.zeros([128, 1024], dtype=np.float32) b = np.random.random_sample([128, 1024]).astype(np.float32) c = nki_tensor_tensor_add(a, b)
metrics = nki_tensor_tensor_add.benchmark_result.nc_latency print("latency.p50 = " + str(metrics.get_latency_percentile(50))) print("latency.p99 = " + str(metrics.get_latency_percentile(99)))
Note
nki.benchmark
does not use the actual inputs passed into the benchmarked function when running the neff file. For instance, in the above example, the output c tensor is undefined and should not be used for numerical accuracy checks.
This document is relevant for: Inf2
, Trn1
, Trn2