NKI API Common Fields — AWS Neuron Documentation (original) (raw)

This document is relevant for: Inf2, Trn1, Trn2

NKI API Common Fields#

Supported Data Types#

Supported Data Types by NKI below lists all supported data types by NKI. Almost all the NKI APIs accept a data type field, dtype, which can either be a NumPy equivalent type or a nki.language data type.

Table 4 Supported Data Types by NKI#

Data Type	Accepted dtype Field by NKI APIs
Integer	8-bit unsigned integer	nki.language.uint8 or numpy.uint8
8-bit signed integer	nki.language.int8 or numpy.int8
16-bit unsigned integer	nki.language.uint16 or numpy.uint16
16-bit signed integer	nki.language.int16 or numpy.int16
32-bit unsigned integer	nki.language.uint32 or numpy.uint32
32-bit signed integer	nki.language.int32 or numpy.int32
Float	float8_e4m3 (1S,4E,3M) [2]	nki.language.float8_e4m3
float8_e5m2 (1S,5E,2M)	nki.language.float8_e5m2
float16 (1S,5E,10M)	nki.language.float16 or numpy.float16
bfloat16 (1S,8E,7M)	nki.language.bfloat16
tfloat32 (1S,8E,10M)	nki.language.tfloat32
float32 (1S,8E,23M)	nki.language.float32 or numpy.float32
Boolean	boolean stored as uint8	nki.language.bool_ or numpy.bool

Supported Math Operators for NKI ISA#

Supported Math Operators by NKI ISA below lists all the mathematical operator primitives supported by NKI. Many nki.isa APIs (instructions) allow programmable operators through the op field. The supported operators fall into two categories: bitvec and arithmetic. In general, instructions using bitvec operators expect integer data types and treat input elements as bit patterns. On the other hand, instructions using arithmetic operators accept any valid NKI data types and convert input elements into float32 before performing the operators.

Table 5 Supported Math Operators by NKI ISA#

Operator	op	Legal Reduction op	Supported Engine
Bitvec	Bitwise Not	nki.language.invert	N	Vector
Bitwise And	nki.language.bitwise_and	Y	Vector
Bitwise Or	nki.language.bitwise_or	Y	Vector
Bitwise Xor	nki.language.bitwise_xor	Y	Vector
Arithmetic Shift Left	nki.language.left_shift	N	Vector
Arithmetic Shift Right	Not supported	N	Vector
Logical Shift Left	nki.language.left_shift	N	Vector
Logical Shift Right	nki.language.right_shift	N	Vector
Arithmetic	Add	nki.language.add	Y	Vector/GpSIMD/Scalar
Subtract	nki.language.subtract	Y	Vector
Multiply	nki.language.multiply	Y	Vector/GpSIMD/Scalar
Max	nki.language.maximum	Y	Vector
Min	nki.language.minimum	Y	Vector
Is Equal to	nki.language.equal	N	Vector
Is Not Equal to	nki.language.not_equal	N	Vector
Is Greater than or Equal to	nki.language.greater_equal	N	Vector
Is Greater than to	nki.language.greater	N	Vector
Is Less than or Equal to	nki.language.less_equal	N	Vector
Is Less than	nki.language.less	N	Vector
Logical Not	nki.language.logical_not	N	Vector
Logical And	nki.language.logical_and	Y	Vector
Logical Or	nki.language.logical_or	Y	Vector
Logical Xor	nki.language.logical_xor	Y	Vector
Reverse Square Root	nki.language.rsqrt	N	GpSIMD/Scalar
Reciprocal	nki.language.reciprocal	N	Vector/Scalar
Absolute	nki.language.abs	N	Vector/Scalar
Power	nki.language.power	N	GpSIMD

Note Add and Multiply are supported on Scalar Engine only from NeuronCore-v3. 32-bit integer Add and Multiply are only supported on GpSIMD Engine.

Supported Activation Functions for NKI ISA#

Supported Activation Functions by NKI ISA below lists all the activation function supported by the nki.isa.activation API. These activation functions are approximated with piece-wise polynomials on Scalar Engine.NOTE: if input values fall outside the supported Valid Input Range listed below, the Scalar Engine will generate invalid output results.

Table 6 Supported Activation Functions by NKI ISA#

Function Name	Accepted op by Scalar Engine	Valid Input Range
Identity	nki.language.copy or numpy.copy	[-inf, inf]
Square	nki.language.square or numpy.square	[-inf, inf]
Sigmoid	nki.language.sigmoid	[-inf, inf]
Relu	nki.language.relu	[-inf, inf]
Gelu	nki.language.gelu	[-inf, inf]
Gelu Derivative	nki.language.gelu_dx	[-inf, inf]
Gelu with Tanh Approximation	nki.language.gelu_apprx_tanh	[-inf, inf]
Silu	nki.language.silu	[-inf, inf]
Silu Derivative	nki.language.silu_dx	[-inf, inf]
Tanh	nki.language.tanh or numpy.tanh	[-inf, inf]
Softplus	nki.language.softplus	[-inf, inf]
Mish	nki.language.mish	[-inf, inf]
Erf	nki.language.erf	[-inf, inf]
Erf Derivative	nki.language.erf_dx	[-inf, inf]
Exponential	nki.language.exp or numpy.exp	[-inf, inf]
Natural Log	nki.language.log or numpy.log	[2^-64, 2^64]
Sine	nki.language.sin or numpy.sin	[-PI, PI]
Arctan	nki.language.arctan or numpy.arctan	[-PI/2, PI/2]
Square Root	nki.language.sqrt or numpy.sqrt	[2^-100, 2^100]
Reverse Square Root	nki.language.rsqrt	[2^-87, 2^97]
Reciprocal	nki.language.reciprocal or numpy.reciprocal	±[2^-42, 2^42]
Sign	nki.language.sign or numpy.sign	[-inf, inf]
Absolute	nki.language.abs or numpy.abs	[-inf, inf]

NKI API Masking#

All nki.language and nki.isa APIs accept an optional input field, mask. The mask field is an execution predicate known at compile-time, which informs the compiler to skip generating the instruction or generate the instruction with a smaller input tile shape. Masking is handled completely by Neuron compiler and hence does not incur any performance overhead in the generated instructions.

The mask can be created using comparison expressions (e.g., a < b) or multiple comparison expressions concatenated with & (e.g., (a < b) & (c > d)). The left- or right-hand side expression of each comparator must be an affine expression of nki.language.arange(),nki.language.affine_range() or nki.language.program_id() . Each comparison expression should indicate which range of indices along one of the input tile axes should be valid for the computation. For example, assume we have an input tile in_tile of shape (128, 512), and we would like to perform a square operation on this tile for elements in [0:64, 0:256], we can invoke the nki.language.square()API using the following:

import neuronxcc.nki.language as nl

... i_p = nl.arange(128)[:, None] i_f = nl.arange(512)[None, :]

out_tile = nl.square(in_tile, mask=((i_p<64) & (i_f<256)))

The above example will be lowered into a hardware ISA instruction that only processes 64x256 elements by Neuron Compiler.

The above mask definition works for most APIs where there is only one input tile or both input tiles share the same axes. One exception is the nki.language.matmul and similarly nki.isa.nc_matmulAPI, where the two input tiles lhs and rhs contain three unique axes:

The contraction axis: both lhs and rhs partition axis (lhs_rhs_p)
The first axis of matmul output: lhs free axis (lhs_f)
The second axis of matmul output: rhs free axis (rhs_f)

As an example, let’s assume we have lhs tile of shape (sz_p, sz_m)and rhs tile of shape (sz_p, sz_n), and we call nki.language.matmul to calculate an output tile of shape (sz_m, sz_n):

import neuronxcc.nki.language as nl

i_p = nl.arange(sz_p)[:, None]

i_lhs_f = nl.arange(sz_m)[None, :] i_rhs_f = nl.arange(sz_n)[None, :] # same as i_rhs_f = i_lhs_f

result = nl.matmul(lhs[i_p, i_lhs_f], rhs[i_p, i_rhs_f], transpose_x=True)

Since both i_lhs_f and i_rhs_f are identical to the Neuron Compiler, the Neuron Compiler cannot distinguish the two input axes if they were to be passed into the mask field directly.

Therefore, we introduce “operand masking” syntax for matmult APIs to let users to precisely define the masking on the inputs to the matmult APIs (currently only matmult APIs support operand masking, subject to changes in future releases). Let’s assume we need to constraint sz_m <= 64 andsz_n <= 256:

import neuronxcc.nki.language as nl

i_p = nl.arange(sz_p)[:, None]

i_lhs_f = nl.arange(sz_m)[None, :] i_rhs_f = nl.arange(sz_n)[None, :] # same as i_rhs_f = i_lhs_f

i_lhs_f_virtual = nl.arange(sz_m)[None, :, None]

result = nl.matmul(lhs_T[i_lhs_f <= 64], rhs[i_rhs_f <= 256], transpose_x=True)

There are two notable use cases for masking:

When the tiling factor doesn’t divide the tensor dimension sizes
Skip ineffectual instructions that compute known output values

We will present an example of the first use case below. Let’s assume we would like to evaluate the exponential function on an input tensor of shape [sz_p, sz_f] from HBM. Since the input tonki.language.load/nki.language.store/nki.language.exp expects a tile with a partition axis size not exceedingnki.language.tile_size.pmax == 128, we should loop over the input tensor using a tile size of [nki.language.tile_size.pmax, sz_f].

However, sz_p is not guaranteed to be an integer multiple of nki.language.tile_size.pmax. In this case, one option is to write a loop with trip count of sz_p // nki.language.tile_size.pmax followed by a single invocation of nki.language.exp with an input tile of shape [sz_p % nki.language.tile_size.pmax, sz_f]. This effectively “unrolls” the last instance of tile computation, which could lead to messy code in a complex kernel. Using masking here will allow us to avoid such unrolling, as illustrated in the example below:

import neuronxcc.nki.language as nl from torch_neuronx import nki_jit

@nki_jit def tensor_exp_kernel_(in_tensor, out_tensor):

sz_p, sz_f = in_tensor.shape

i_f = nl.arange(sz_f)[None, :]

trip_count = math.ceil(sz_p/nl.tile_size.pmax)

for p in nl.affine_range(trip_count): # Generate tensor indices for the input/output tensors # pad index to pmax, for simplicity i_p = p * nl.tile_size.pmax + nl.arange(nl.tile_size.pmax)[:, None]

# Load input data from external memory to on-chip memory
# only read up to sz_p
in_tile = nl.load(in_tensor[i_p, i_f], mask=(i_p < sz_p))

# perform the computation
out_tile = nl.exp(in_tile, mask=(i_p < sz_p))

# store the results back to external memory
# only write up to sz_p
nl.store(out_tensor[i_p, i_f], value=out_tile, mask=(i_p<sz_p))

NKI Type Promotion#

When the data types (dtypes) of inputs to an arithmetic operation (i.e., add, multiply, tensor_tensor, etc.) differ, we promote the dtypes following the rules below:

(float, integer): Pick the float type. Example:

(np.int32, np.float16) -> np.float16
(np.uint16, nl.tfloat32) -> nl.tfloat32

(float, float): Pick the wider float type or a new widened type that fits the values range. Example:

(np.float32, nl.tfloat32) -> np.float32
(np.float32, nl.bfloat16) -> np.float32
(np.float16, nl.bfloat16) -> np.float32 (new widened type)
(nl.float8_e4m3, np.float16) -> np.float16
(nl.float8_e4m3, nl.bfloat16) -> nl.bfloat16
(nl.float8_e4m3, nl.float8_e5m2) -> nl.bfloat16 (new widened type)

(int, int): Pick the wider type or a new widened type that fits the values range. Example:

(np.int16, np.int32) -> np.int32
(np.uint8, np.uint16) -> np.uint16
(np.uint16, np.int32) -> np.int32
(np.int8, np.uint8) -> np.int16 (new widened type)
(np.int8, np.uint16) -> np.int32 (new widened type)
(np.int32, np.uint32) -> np.float32 (new widened type is float32, since int64 isn’t supported on the hardware)

The output of the arithmetic operation will get the promoted type by default.

Note: The Vector Engine internally performs most of the computation in FP32 (see Vector Engine) and casts the output back to the specific type.

x = np.ndarray((N, M), dtype=nl.float8_e4m3) y = np.ndarray((N, M), dtype=np.float16) z = nl.add(x, y) # calculation done in FP32, output cast to np.float16 assert z.dtype == np.float16

To prevent the compiler from automatically widening output dtype based on mismatching input dtypes, you may explicitly set the output dtype in the arithmetic operation API. This would be useful if the output is passed into another operation that benefits from a smaller dtype.

x = np.ndarray((N, M), dtype=nl.bfloat16) y = np.ndarray((N, M), dtype=np.float16) z = nl.add(x, y, dtype=nl.bfloat16) # without explicit dtype, z.dtype would have been np.float32 assert z.dtype == nl.bfloat16

Weakly Typed Scalar Type Inference#

Weakly typed scalars (scalar values where the type wasn’t explicitly specified) will be inferred as the widest dtype supported by hardware:

bool --> uint8
integer --> int32
floating --> float32

Doing an arithmetic operation with a scalar may result in a larger output type than expected, for example:

(np.int8, 2) -> np.int32
(np.float16, 1.2) -> np.float32

To prevent larger dtypes from being inferred from weak scalar types, do either of:

Explicitly set the datatype of the scalar, like np.int8(2), so that the output type is what you desire:

x = np.ndarray((N, M), dtype=np.float16) y = np.float16(2) z = nl.add(x, y) assert z.dtype == np.float16

Explicitly set the output dtype of the arithmetic operation:

x = np.ndarray((N, M), dtype=np.int16) y = 2 z = nl.add(x, y, dtype=nl.bfloat16) assert z.dtype == nl.bfloat16

Note: The Vector Engine internally performs most of the computation in FP32 (see Vector Engine) and casts the output back to the specific type.

NKI Engine Selection for Operators Supported on Multiple Engines#

There is a tradeoff between precision and speed on different engines for operators with multiple engine options. Users can select which engine to map to based on their needs. We take reciprocal and reverse square root as two examples and explain the tradeoff below.

Reciprocal can run on Scalar Engine or Vector Engine:

Reciprocal can run on Vector Engine with nki.isa.reciprocal or on Scalar Engine with nki.isa.activation(nl.reciprocal). Vector Engine performs reciprocal at a higher precision compared to Scalar Engine; however, the computation throughput of reciprocal on Vector Engine is about 8x lower than Scalar Engine for large input tiles. For input tiles with a small number of elements per partition (less than 64, processed one per cycle), instruction initiation interval (roughly 64 cycles) dominates performance so Scalar Engine and Vector Engine have comparable performance. In this case, we suggest using Vector Engine to achieve better precision.

Estimated cycles on different engines:

Cost (Engine Cycles) Condition

max(MIN_II, N) mapped to Scalar Engine nki.isa.scalar_engine

max(MIN_II, 8*N) mapped to Vector Engine nki.isa.vector_engine

where,

N is the number of elements per partition in the input tile.

MIN_II is the minimum instruction initiation interval for small input tiles.MIN_II is roughly 64 engine cycles.

Note nki.isa.activation(op=nl.reciprocal) doesn’t support setting bias on NeuronCore-v2.

Cost (Engine Cycles)	Condition
max(MIN_II, N)	mapped to Scalar Engine nki.isa.scalar_engine
max(MIN_II, 8*N)	mapped to Vector Engine nki.isa.vector_engine

Reverse square root can run on GpSIMD Engine or Scalar Engine:

Reverse square root can run on GpSIMD Engine with nki.isa.tensor_scalar(op0=nl.rsqrt, operand0=0.0) or on Scalar Engine with nki.isa.activation(nl.rsqrt). GpSIMD Engine performs reverse square root at a higher precision compared to Scalar Engine; however, the computation throughput of reverse square root on GpSIMD Engine is 4x lower than Scalar Engine.

Footnotes

This document is relevant for: Inf2, Trn1, Trn2