nki.isa.tensor_scalar — AWS Neuron Documentation (original) (raw)
This document is relevant for: Inf2
, Trn1
, Trn2
nki.isa.tensor_scalar#
nki.isa.tensor_scalar(data, op0, operand0, reverse0=False, op1=None, operand1=None, reverse1=False, *, dtype=None, mask=None, engine=engine.unknown, **kwargs)[source]#
Apply up to two math operators to the input data
tile by broadcasting scalar/vector operands in the free dimension using Vector or Scalar or GpSimd Engine: (data <op0> operand0) <op1> operand1
.
The input data
tile can be an SBUF or PSUM tile. Both operand0
and operand1
can be SBUF or PSUM tiles of shape (data.shape[0], 1)
, i.e., vectors, or compile-time constant scalars.
op1
and operand1
are optional, but must be None
(default values) when unused. Note, performing one operator has the same performance cost as performing two operators in the instruction.
When the operators are non-commutative (e.g., subtract), we can reverse ordering of the inputs for each operator through:
reverse0 = True
:tmp_res = operand0 <op0> data
reverse1 = True
:operand1 <op1> tmp_res
The tensor_scalar
instruction supports two types of operators: 1) bitvec operators (e.g., bitwise_and) and 2) arithmetic operators (e.g., add). See Supported Math Operators for NKI ISA for the full list of supported operators. The two operators, op0
and op1
, in a tensor_scalar
instruction must be of the same type (both bitvec or both arithmetic). If bitvec operators are used, the tensor_scalar
instruction must run on Vector Engine. Also, the input/output data types must be integer types, and input elements are treated as bit patterns without any data type casting.
If arithmetic operators are used, the tensor_scalar
instruction can run on Vector or Scalar or GpSimd Engine. However, each engine supports limited arithmetic operators (see :ref:tbl-aluop
). The Scalar Engine on trn2 only supports a subset of the operator combination:
op0=np.multiply
andop1=np.add
op0=np.multiply
andop1=None
op0=add
andop1=None
Also, arithmetic operators impose no restriction on the input/output data types, but the engine automatically casts input data types to float32 and performs the operators in float32 math. The float32 computation results are cast to the target data type specified in the dtype
field before written into the output tile, at no additional performance cost. If the dtype
field is not specified, it is default to be the same as input tile data type.
Estimated instruction cost:
max(MIN_II, N)
Vector or Scalar Engine cycles, where
N
is the number of elements per partition indata
.MIN_II
is the minimum instruction initiation interval for small input tiles.MIN_II
is roughly 64 engine cycles.
Parameters:
- data – the input tile
- op0 – the first math operator used with operand0 (see Supported Math Operators for NKI ISA for supported operators)
- operand0 – a scalar constant or a tile of shape
(data.shape[0], 1)
, where data.shape[0] is the partition axis size of the inputdata
tile - reverse0 – reverse ordering of inputs to
op0
; if false,operand0
is the rhs ofop0
; if true,operand0
is the lhs ofop0
- op1 – the second math operator used with operand1 (see Supported Math Operators for NKI ISA for supported operators); this operator is optional
- operand1 – a scalar constant or a tile of shape
(data.shape[0], 1)
, where data.shape[0] is the partition axis size of the inputdata
tile - reverse1 – reverse ordering of inputs to
op1
; if false,operand1
is the rhs ofop1
; if true,operand1
is the lhs ofop1
- dtype – (optional) data type to cast the output type to (see Supported Data Types for more information); if not specified, it will default to be the same as the data type of the input tile.
- mask – (optional) a compile-time constant predicate that controls whether/how this instruction is executed (see NKI API Masking for details)
- engine – (optional) the engine to use for the operation: nki.isa.vector_engine, nki.isa.scalar_engine,nki.isa.gpsimd_engine (only allowed for rsqrt) or nki.isa.unknown_engine (default, let compiler select best engine based on the input tile shape).
Returns:
an output tile of (data <op0> operand0) <op1> operand1
computation
Example:
import neuronxcc.nki.isa as nisa import neuronxcc.nki.language as nl import numpy as np ...
##################################################################
Example 1: subtract 1.0 from all elements of tile a of
shape (128, 512) and get the output tile in b
################################################################## i_p = nl.arange(128)[:, None] i_f = nl.arange(512)[None, :]
b = nisa.tensor_scalar(a[i_p, i_f], np.subtract, 1.0)
##################################################################
Example 2: broadcast 1.0 into a shape of (128, 512) and subtract
it with tile c to get output tile d
################################################################## i_p = nl.arange(128)[:, None] i_f = nl.arange(512)[None, :]
d = nisa.tensor_scalar(c[i_p, i_f], np.subtract, 1.0, reverse0=True)
##################################################################
Example 3: broadcast multiply tile e with vector f and
then broadcast add with scalar 2.5;
tile e has a shape of (64, 1024) and vector f has a shape of (64, 1)
################################################################## i_p_ef = nl.arange(64)[:, None] i_f_e = nl.arange(1024)[None, :] i_f_f = nl.arange(1)[None, :]
g = nisa.tensor_scalar(e[i_p_ef, i_f_e], op0=np.multiply, operand0=f[i_p_ef, i_f_f], op1=np.add, operand1=2.5)
This document is relevant for: Inf2
, Trn1
, Trn2