nki.isa.tensor_scalar — AWS Neuron Documentation (original) (raw)

This document is relevant for: Inf2, Trn1, Trn2

nki.isa.tensor_scalar#

nki.isa.tensor_scalar(data, op0, operand0, reverse0=False, op1=None, operand1=None, reverse1=False, *, dtype=None, mask=None, engine=engine.unknown, **kwargs)[source]#

Apply up to two math operators to the input data tile by broadcasting scalar/vector operands in the free dimension using Vector or Scalar or GpSimd Engine: (data <op0> operand0) <op1> operand1.

The input data tile can be an SBUF or PSUM tile. Both operand0 and operand1 can be SBUF or PSUM tiles of shape (data.shape[0], 1), i.e., vectors, or compile-time constant scalars.

op1 and operand1 are optional, but must be None (default values) when unused. Note, performing one operator has the same performance cost as performing two operators in the instruction.

When the operators are non-commutative (e.g., subtract), we can reverse ordering of the inputs for each operator through:

reverse0 = True: tmp_res = operand0 <op0> data

reverse1 = True: operand1 <op1> tmp_res

The tensor_scalar instruction supports two types of operators: 1) bitvec operators (e.g., bitwise_and) and 2) arithmetic operators (e.g., add). See Supported Math Operators for NKI ISA for the full list of supported operators. The two operators, op0 and op1, in a tensor_scalar instruction must be of the same type (both bitvec or both arithmetic). If bitvec operators are used, the tensor_scalar instruction must run on Vector Engine. Also, the input/output data types must be integer types, and input elements are treated as bit patterns without any data type casting.

If arithmetic operators are used, the tensor_scalar instruction can run on Vector or Scalar or GpSimd Engine. However, each engine supports limited arithmetic operators (see :ref:tbl-aluop). The Scalar Engine on trn2 only supports a subset of the operator combination:

op0=np.multiply and op1=np.add

op0=np.multiply and op1=None

op0=add and op1=None

Also, arithmetic operators impose no restriction on the input/output data types, but the engine automatically casts input data types to float32 and performs the operators in float32 math. The float32 computation results are cast to the target data type specified in the dtype field before written into the output tile, at no additional performance cost. If the dtype field is not specified, it is default to be the same as input tile data type.

Estimated instruction cost:

max(MIN_II, N) Vector or Scalar Engine cycles, where

N is the number of elements per partition in data.
MIN_II is the minimum instruction initiation interval for small input tiles.MIN_II is roughly 64 engine cycles.

Parameters:

data – the input tile
op0 – the first math operator used with operand0 (see Supported Math Operators for NKI ISA for supported operators)
operand0 – a scalar constant or a tile of shape (data.shape[0], 1), where data.shape[0] is the partition axis size of the input data tile
reverse0 – reverse ordering of inputs to op0; if false, operand0 is the rhs of op0; if true, operand0 is the lhs of op0
op1 – the second math operator used with operand1 (see Supported Math Operators for NKI ISA for supported operators); this operator is optional
operand1 – a scalar constant or a tile of shape (data.shape[0], 1), where data.shape[0] is the partition axis size of the input data tile
reverse1 – reverse ordering of inputs to op1; if false, operand1 is the rhs of op1; if true, operand1 is the lhs of op1
dtype – (optional) data type to cast the output type to (see Supported Data Types for more information); if not specified, it will default to be the same as the data type of the input tile.
mask – (optional) a compile-time constant predicate that controls whether/how this instruction is executed (see NKI API Masking for details)
engine – (optional) the engine to use for the operation: nki.isa.vector_engine, nki.isa.scalar_engine,nki.isa.gpsimd_engine (only allowed for rsqrt) or nki.isa.unknown_engine (default, let compiler select best engine based on the input tile shape).

Returns:

an output tile of (data <op0> operand0) <op1> operand1 computation

Example:

import neuronxcc.nki.isa as nisa import neuronxcc.nki.language as nl import numpy as np ...

##################################################################

Example 1: subtract 1.0 from all elements of tile a of

shape (128, 512) and get the output tile in b

################################################################## i_p = nl.arange(128)[:, None] i_f = nl.arange(512)[None, :]

b = nisa.tensor_scalar(a[i_p, i_f], np.subtract, 1.0)

##################################################################

Example 2: broadcast 1.0 into a shape of (128, 512) and subtract

it with tile c to get output tile d

################################################################## i_p = nl.arange(128)[:, None] i_f = nl.arange(512)[None, :]

d = nisa.tensor_scalar(c[i_p, i_f], np.subtract, 1.0, reverse0=True)

##################################################################

Example 3: broadcast multiply tile e with vector f and

then broadcast add with scalar 2.5;

tile e has a shape of (64, 1024) and vector f has a shape of (64, 1)

################################################################## i_p_ef = nl.arange(64)[:, None] i_f_e = nl.arange(1024)[None, :] i_f_f = nl.arange(1)[None, :]

g = nisa.tensor_scalar(e[i_p_ef, i_f_e], op0=np.multiply, operand0=f[i_p_ef, i_f_f], op1=np.add, operand1=2.5)

This document is relevant for: Inf2, Trn1, Trn2