nki.isa.activation — AWS Neuron Documentation (original) (raw)

This document is relevant for: Inf2, Trn1, Trn2

nki.isa.activation#

nki.isa.activation(op, data, *, bias=None, scale=1.0, reduce_op=None, reduce_res=None, reduce_cmd=reduce_cmd.idle, mask=None, dtype=None, **kwargs)[source]#

Apply an activation function on every element of the input tile using Scalar Engine. The activation function is specified in the op input field (see Supported Activation Functions for NKI ISA for a list of supported activation functions and their valid input ranges).

The activation instruction can optionally multiply the input data by a scalar or vector scaleand then add another vector bias before the activation function is applied, at no additional performance cost:

\[output = f_{act}(data * scale + bias)\]

When the scale is a scalar, it must be a compile-time constant. In this case, the scale is broadcasted to all the elements in the input data tile. When the scale/bias is a vector, it must have the same partition axis size as the input data tile and only one element per partition. In this case, the element of scale/bias within each partition is broadcasted to elements of the input data tile in the same partition.

There are 128 registers on the scalar engine for storing reduction results, corresponding to the 128 partitions of the input. The scalar engine can reduce along free dimensions without extra performance penalty, and store the result of reduction into these registers. The reduction is done after the activation function is applied.

\[output = f_{act}(data * scale + bias) accu\_registers = reduce\_op(accu\_registers, reduce\_op(output, axis=))\]

These registers are shared between activation and activation_accu calls, and the state of them can be controlled via the reduce_cmd parameter.

We can choose to read out the current values stored in the register by passing in a tensor in the reduce_res arguments. Reading out the accumulator will incur a small overhead.

Note that activation_accu can also change the state of the registers. It’s user’s responsibility to ensure correct ordering. It’s recommended to not mixing the use of activation_accu and activation, when reduce_cmd is not set to idle.

Note, the Scalar Engine always performs the math operations in float32 precision. Therefore, the engine automatically casts the input data tile to float32 before performing multiply/add/activate specified in the activation instruction. The engine is also capable of casting the float32 math results into another output data type specified by the dtype field at no additional performance cost. If dtype field is not specified, Neuron Compiler will set output data type of the instruction to be the same as input data type of data. On the other hand, the scale parameter must have a float32 data type, while the bias parameter can be float32/float16/bfloat16.

The input data tile can be an SBUF or PSUM tile. Similarly, the instruction can write the output tile into either SBUF or PSUM, which is specified using the buffer field. If not specified, nki.language.sbuf is selected by default.

Estimated instruction cost:

max(MIN_II, N) Scalar Engine cycles, where

Parameters:

Returns:

output tile of the activation instruction; layout: same as input data tile

Example:

import neuronxcc.nki.language as nl import neuronxcc.nki.isa as nisa

##################################################################

Example 1: perform exponential function on matrix a of shape (128, 1024)

################################################################## a = nl.load(a_tensor) activated_a = nisa.activation(op=nl.exp, data=a) nl.store(a_act_tensor, activated_a)

##################################################################

Example 2: perform the following operations to matrix b of shape (128, 512)

using a single activation instruction: np.square(b * 2.0) + c

1) compute np.square(b * 2.0 + c)

2) cast 1) results into bfloat16

################################################################## b = nl.load(b_tensor) c = nl.load(c_tensor) activated_b = nisa.activation(op=np.square, data=b, bias=c, scale=2.0, dtype=nl.bfloat16) nl.store(b_act_tensor, activated_b)

This document is relevant for: Inf2, Trn1, Trn2