nki.compiler.sbuf.mod_alloc — AWS Neuron Documentation (original) (raw)

This document is relevant for: Inf2, Trn1, Trn2

nki.compiler.sbuf.mod_alloc#

nki.compiler.sbuf.mod_alloc(*, base_addr, base_partition=0, num_par_tiles=(), num_free_tiles=())[source]#

Allocate SBUF memory space for each logical tile in a tensor through modulo allocation.

This is one of the NKI direction allocation APIs. We recommend reading NKI Direct Allocation Developer Guide before using these APIs.

This API is equivalent to calling nisa.compiler.alloc()with a callable psum_modulo_alloc_func as defined below.

1from typing import Optional, Tuple 2from functools import reduce 3from operator import mul 4import unittest 5 6def num_elms(shape): 7 return reduce(mul, shape, 1) 8 9def linearize(shape, indices): 10 return sum(i * num_elms(shape[dim+1:]) for dim, i in enumerate(indices)) 11 12def modulo_allocate_func(base, allocate_shape, scale): 13 def func(indices): 14 if not allocate_shape: 15 # default shape is always (1, 1, ...) 16 allocate_shape_ = (1, ) * len(indices) 17 else: 18 allocate_shape_ = allocate_shape 19 mod_idx = tuple(i % s for i, s in zip(indices, allocate_shape_)) 20 return linearize(shape=allocate_shape_, indices=mod_idx) * scale + base 21 return func 22 23def mod_alloc(base_addr: int, *, 24 base_partition: Optional[int] = 0, 25 num_par_tiles: Optional[Tuple[int, ...]] = (), 26 num_free_tiles: Optional[Tuple[int, ...]] = ()): 27 def sbuf_modulo_alloc_func(idx, pdim_size, fdim_size): 28 return (modulo_allocate_func(base_partition, num_par_tiles, pdim_size)(idx), 29 modulo_allocate_func(base_addr, num_free_tiles, fdim_size)(idx)) 30 return sbuf_modulo_alloc_func 31

Here’s an example usage of this API:

nki_tensor = nl.ndarray((4, par_dim(128), 512), dtype=nl.bfloat16, buffer=nki.compiler.sbuf.mod_alloc(base_addr=0, num_free_tiles=(2, )))

for i_block in nl.affine_range(4): nki_tensor[i_block, :, :] = nl.load(...) ... = nl.exp(nki_tensor[i_block, :, :])

This produces the following allocation:

Table 2 Modulo Allocation Example#

Logical Tile Index Physical Tile start_partition Physical Tile byte_addr
(0, ) 0 0 + (0 % 2) * 512 * sizeof(nl.bfloat16) = 0
(1, ) 0 0 + (1 % 2) * 512 * sizeof(nl.bfloat16) = 1024
(2, ) 0 0 + (2 % 2) * 512 * sizeof(nl.bfloat16) = 0
(3, ) 0 0 + (3 % 2) * 512 * sizeof(nl.bfloat16) = 1024

With above scheme, we are able to implement double buffering in nki_tensor, such that nl.load in one iteration can write to one physical tile while nl.exp of the previous iteration can read from the other physical tile simultaneously.

Note

In current release, programmers cannot mix NKI tensor declarations using automatic allocation (ncc.sbuf.auto_alloc() or the PSUM variant) and direction allocation APIs (ncc.sbuf.alloc(), ncc.sbuf.mod_alloc() or the PSUM variants).

Parameters:

This document is relevant for: Inf2, Trn1, Trn2