nki.isa.dma_copy — AWS Neuron Documentation (original) (raw)
This document is relevant for: Inf2
, Trn1
, Trn2
nki.isa.dma_copy#
nki.isa.dma_copy(*, dst, src, mask=None, dst_rmw_op=None, oob_mode=oob_mode.error, dge_mode=dge_mode.unknown)[source]#
Copy data from src
to dst
using DMA engine. Both src
and dst
tiles can be in device memory (HBM) or SBUF. However, if both src
and dst
tiles are in SBUF, consider usingnisa.tensor_copy instead for better performance.
Parameters:
- src – the source of copy.
- dst – the dst of copy.
- dst_rmw_op – the read-modify-write operation to be performed at the destination. Currently only
np.add
is supported, which adds the source data to the existing destination data. IfNone
, the source data directly overwrites the destination. Ifdst_rmw_op
is specified, onlyoob_mode=oob_mode.error
is allowed. For best performance with Descriptor Generation Engine (DGE), unique dynamic offsets must be used to accessdst
. Multiple accesses to the same offset will cause a data hazard. If duplicated offsets are present, the compiler automatically adds synchronization to avoid hazards, which slows down computation. - mask – (optional) a compile-time constant predicate that controls whether/how this instruction is executed (see NKI API Masking for details)
- mode –
(optional) Specifies how to handle out-of-bounds (oob) array indices during indirect access operations. Valid modes are:oob_mode.error
: (Default) Raises an error when encountering out-of-bounds indices.oob_mode.skip
: Silently skips any operations involving out-of-bounds indices.
For example, when using indirect gather/scatter operations, out-of-bounds indices can occur if the index array contains values that exceed the dimensions of the target array.
- dge_mode – (optional) specify which Descriptor Generation Engine (DGE) mode to use for copy:
nki.isa.dge_mode.none
(turn off DGE) ornki.isa.dge_mode.swdge
(software DGE) ornki.isa.dge_mode.hwdge
(hardware DGE) ornki.isa.dge_mode.unknown
(by default, let compiler select the best DGE mode). HWDGE is only supported for NeuronCore-v3+.
A cast will happen if the src
and dst
have different dtype.
Example:
import neuronxcc.nki.isa as nisa
############################################################################
Example 1: Copy over the tensor to another tensor
############################################################################ nisa.dma_copy(dst=b, src=a)
import neuronxcc.nki.isa as nisa import neuronxcc.nki.language as nl from neuronxcc.nki.typing import tensor
############################################################################
Example 2: Load elements from HBM with indirect addressing. If addressing
results out-of-bound access, the operation will fail.
############################################################################
... n, m = in_tensor.shape ix, iy = nl.mgrid[0:n//2, 0:m]
expr_arange = 2*nl.arange(n//2)[:, None] idx_tile: tensor[64, 1] = nisa.iota(expr_arange, dtype=np.int32)
out_tile: tensor[64, 512] = nisa.memset(shape=(n//2, m), value=-1, dtype=in_tensor.dtype) nisa.dma_copy(src=in_tensor[idx_tile, iy], dst=out_tile[ix, iy], oob_mode=nisa.oob_mode.error)
import neuronxcc.nki.isa as nisa import neuronxcc.nki.language as nl from neuronxcc.nki.typing import tensor
############################################################################
Example 3: Load elements from HBM with indirect addressing. If addressing
results in out-of-bounds access, the operation will fail.
############################################################################
... n, m = in_tensor.shape ix, iy = nl.mgrid[0:n//2, 0:m]
indices are out of range on purpose to demonstrate the error
expr_arange = 3*nl.arange(n//2)[:, None] idx_tile: tensor[64, 1] = nisa.iota(expr_arange, dtype=np.int32)
out_tile: tensor[64, 512] = nisa.memset(shape=(n//2, m), value=-1, dtype=in_tensor.dtype) nisa.dma_copy(src=in_tensor[idx_tile, iy], dst=out_tile[ix, iy], oob_mode=nisa.oob_mode.error)
import neuronxcc.nki.isa as nisa import neuronxcc.nki.language as nl from neuronxcc.nki.typing import tensor
############################################################################
Example 4: Load elements from HBM with indirect addressing. If addressing
results in out-of-bounds access, the operation will skip indices.
############################################################################
... n, m = in_tensor.shape ix, iy = nl.mgrid[0:n//2, 0:m]
indices are out of range on purpose
expr_arange = 3*nl.arange(n//2)[:, None] idx_tile: tensor[64, 1] = nisa.iota(expr_arange, dtype=np.int32)
out_tile: tensor[64, 512] = nisa.memset(shape=(n//2, m), value=-1, dtype=in_tensor.dtype) nisa.dma_copy(src=in_tensor[idx_tile, iy], dst=out_tile[ix, iy], oob_mode=nisa.oob_mode.skip)
import neuronxcc.nki.isa as nisa import neuronxcc.nki.language as nl from neuronxcc.nki.typing import tensor
############################################################################
Example 5: Store elements to HBM with indirect addressing and with
read-modifed-write operation.
############################################################################
... n, m = in_tensor.shape ix, iy = nl.mgrid[0:n, 0:m]
expr_arange = 2*nl.arange(n)[:, None] inp_tile: tensor[64, 512] = nl.load(in_tensor[ix, iy]) idx_tile: tensor[64, 1] = nisa.iota(expr_arange, dtype=np.int32)
out_tile: tensor[128, 512] = nisa.memset(shape=(2*n, m), value=1, dtype=in_tensor.dtype) nl.store(out_tensor, value=out_tile) nisa.dma_copy(dst=out_tensor[idx_tile, iy], src=inp_tile, dst_rmw_op=np.add)
import neuronxcc.nki.isa as nisa import neuronxcc.nki.language as nl from neuronxcc.nki.typing import tensor
############################################################################
Example 6: Store elements to HBM with indirect addressing. If indirect
addressing results out-of-bound access, the operation will fail.
############################################################################
... n, m = in_tensor.shape ix, iy = nl.mgrid[0:n, 0:m]
expr_arange = 2*nl.arange(n)[:, None] inp_tile: tensor[64, 512] = nl.load(in_tensor[ix, iy]) idx_tile: tensor[64, 1] = nisa.iota(expr_arange, dtype=np.int32)
out_tile: tensor[128, 512] = nisa.memset(shape=(2*n, m), value=-1, dtype=in_tensor.dtype) nl.store(out_tensor, value=out_tile) nisa.dma_copy(dst=out_tensor[idx_tile, iy], src=inp_tile, oob_mode=nisa.oob_mode.error)
import neuronxcc.nki.isa as nisa import neuronxcc.nki.language as nl from neuronxcc.nki.typing import tensor
############################################################################
Example 7: Store elements to HBM with indirect addressing. If indirect
addressing results out-of-bounds access, the operation will skip indices.
############################################################################
... n, m = in_tensor.shape ix, iy = nl.mgrid[0:n, 0:m]
indices are out of range on purpose to demonstrate the error
expr_arange = 3*nl.arange(n)[:, None] inp_tile: tensor[64, 512] = nl.load(in_tensor[ix, iy]) idx_tile: tensor[64, 1] = nisa.iota(expr_arange, dtype=np.int32)
out_tile: tensor[128, 512] = nisa.memset(shape=(2*n, m), value=-1, dtype=in_tensor.dtype) nl.store(out_tensor, value=out_tile) nisa.dma_copy(dst=out_tensor[idx_tile, iy], src=inp_tile, oob_mode=nisa.oob_mode.error)
############################################################################
Example 8: Store elements to HBM with indirect addressing. If indirect
addressing results out-of-bounds access, the operation will skip indices.
############################################################################
... n, m = in_tensor.shape ix, iy = nl.mgrid[0:n, 0:m]
indices are out of range on purpose
expr_arange = 3*nl.arange(n)[:, None] inp_tile: tensor[64, 512] = nl.load(in_tensor[ix, iy]) idx_tile: tensor[64, 1] = nisa.iota(expr_arange, dtype=np.int32)
out_tile: tensor[128, 512] = nisa.memset(shape=(2*n, m), value=-1, dtype=in_tensor.dtype) nl.store(out_tensor, value=out_tile) nisa.dma_copy(dst=out_tensor[idx_tile, iy], src=inp_tile, oob_mode=nisa.oob_mode.skip)
This document is relevant for: Inf2
, Trn1
, Trn2