nki.isa.nc_stream_shuffle — AWS Neuron Documentation (original) (raw)

This document is relevant for: Inf2, Trn1, Trn2

nki.isa.nc_stream_shuffle#

nki.isa.nc_stream_shuffle(src, dst, shuffle_mask, *, dtype=None, mask=None, **kwargs)[source]#

Apply cross-partition data movement within a quadrant of 32 partitions from source tilesrc to destination tile dst using Vector Engine.

Both source and destination tiles can be in either SBUF or PSUM, and passed in by reference as arguments. In-place shuffle is allowed, i.e., dst same as src. shuffle_mask is a 32-element list. Each mask element must be in data type int or affine expression. shuffle_mask[i] indicates which input partition the output partition [i] copies from within each 32-partition quadrant. The special value shuffle_mask[i]=255means the output tensor in partition [i] will be unmodified. nc_stream_shuffle can be applied to multiple of quadrants. In the case with more than one quadrant, same shuffle_mask is applied to each quadrant.mask applies to dst, meaning that locations masked out by mask will be unmodified. For more information about the cross-partition data movement, see Cross-partition Data Movement.

This API has 4 constraints on src and dst:

  1. dst must have same data type as src.
  2. dst must occupy the same number of partitions num_partitions and have the same number of elements per partition as src.
  3. The number of partitions num_partition accessed by src and dst must be 32 or 64 or 96 or 128 partitions. The start partition start_partition of src does not have to match start_partition of dst
  4. num_partition and start_partition accessed by src and dst must follow rules below.

Estimated instruction cost:

max(MIN_II, N) Vector Engine cycles, where N is the number of elements per partition in src, and MIN_II is the minimum instruction initiation interval for small input tiles. MIN_II is roughly 64 engine cycles.

Parameters:

Example:

import neuronxcc.nki.isa as nisa import neuronxcc.nki.language as nl

#####################################################################

Example 1:

Apply cross-partition data movement to a 32-partition tensor,

in-place shuffling the data in partition[i] to partition[(i+1)%32].

##################################################################### a = nl.load(in_tensor) a_mgrid = nl.mgrid[0:32, 0:128] shuffle_mask = [(i - 1) % 32 for i in range(32)] nisa.nc_stream_shuffle(src=a[a_mgrid.p, a_mgrid.x], dst=a[a_mgrid.p, a_mgrid.x], shuffle_mask=shuffle_mask)

nl.store(out_tensor, value=a)

This document is relevant for: Inf2, Trn1, Trn2