nki.isa.nc_stream_shuffle — AWS Neuron Documentation (original) (raw)
This document is relevant for: Inf2
, Trn1
, Trn2
nki.isa.nc_stream_shuffle#
nki.isa.nc_stream_shuffle(src, dst, shuffle_mask, *, dtype=None, mask=None, **kwargs)[source]#
Apply cross-partition data movement within a quadrant of 32 partitions from source tilesrc
to destination tile dst
using Vector Engine.
Both source and destination tiles can be in either SBUF or PSUM, and passed in by reference as arguments. In-place shuffle is allowed, i.e., dst
same as src
. shuffle_mask
is a 32-element list. Each mask element must be in data type int or affine expression. shuffle_mask[i]
indicates which input partition the output partition [i] copies from within each 32-partition quadrant. The special value shuffle_mask[i]=255
means the output tensor in partition [i] will be unmodified. nc_stream_shuffle
can be applied to multiple of quadrants. In the case with more than one quadrant, same shuffle_mask
is applied to each quadrant.mask
applies to dst
, meaning that locations masked out by mask
will be unmodified. For more information about the cross-partition data movement, see Cross-partition Data Movement.
This API has 4 constraints on src
and dst
:
dst
must have same data type assrc
.dst
must occupy the same number of partitionsnum_partitions
and have the same number of elements per partition assrc
.- The number of partitions
num_partition
accessed bysrc
anddst
must be 32 or 64 or 96 or 128 partitions. The start partitionstart_partition
of src does not have to matchstart_partition
of dst num_partition
andstart_partition
accessed bysrc
anddst
must follow rules below.
- If
num_partition
is 96 or 128,start_partition
must be 0.- If
num_partition
is 64,start_partition
must be 0 or 64.- If
num_partition
is 32,start_partition
must be 0 or 32 or 64 or 96.
Estimated instruction cost:
max(MIN_II, N)
Vector Engine cycles, where N
is the number of elements per partition in src
, and MIN_II
is the minimum instruction initiation interval for small input tiles. MIN_II
is roughly 64 engine cycles.
Parameters:
- src – the source tile
- dst – the destination tile
- shuffle_mask – a 32-element list that specifies the shuffle source and destination partition
- dtype – (optional) data type to cast the output type to (see Supported Data Types for more information); if not specified, it will default to be the same as the data type of the input tile.
- mask – (optional) a compile-time constant predicate that controls whether/how this instruction is executed (see NKI API Masking for details)
Example:
import neuronxcc.nki.isa as nisa import neuronxcc.nki.language as nl
#####################################################################
Example 1:
Apply cross-partition data movement to a 32-partition tensor,
in-place shuffling the data in partition[i] to partition[(i+1)%32].
##################################################################### a = nl.load(in_tensor) a_mgrid = nl.mgrid[0:32, 0:128] shuffle_mask = [(i - 1) % 32 for i in range(32)] nisa.nc_stream_shuffle(src=a[a_mgrid.p, a_mgrid.x], dst=a[a_mgrid.p, a_mgrid.x], shuffle_mask=shuffle_mask)
nl.store(out_tensor, value=a)
This document is relevant for: Inf2
, Trn1
, Trn2