[llvm-dev] [RFC] Vector Predication (original) (raw)

Bruce Hoult via llvm-dev llvm-dev at lists.llvm.org
Tue Feb 5 03:06:17 PST 2019


On Tue, Feb 5, 2019 at 1:23 AM Simon Moll <moll at cs.uni-saarland.de> wrote:

I think this is the usual mixup of AVL and MVL.

AVL: is part of the predicate and can change between vector operations just like a mask can (light weight). MVL: Is the physical vector register length and can be re-configured per function (RVV only atm) - (heavy weight, stop-the-world instruction). The vectorlen parameter in EVL intrinsics is for the AVL.

Unless I misunderstand, this doesn't describe RVV correctly, although this is understandable as the spec has moved around a bit in the last six or twelve months as it's gotten closer to being set in stone.

The way it has ended up (very unlikely to change now) is:

Here's an example:

void foo(size_t n, int64_t *dst, int32_t *a, int32_t *b){ for (size_t i=0; i<n; ++i) dst[i] += a[i] * b[i]; }

If 32x32->64 multiplies are cheaper than 64x64->64 multiplies then you might want to compile this to:

args n in a0, dst in a1, a in a2, b in a3, AVL in t0

foo: vsetvli a4, a0, vsew32,vlmul4 # vtype = 32-bit integer vectors, AVL in a4 vlw.v v0, (a2) # Get 32b vector a into v0-v3 vlw.v v4, (a3) # Get 32b vector b into v4-v7 slli a5, a4, 2 # multiply AVL by element size 4 bytes add a2, a2, a5 # Bump pointer a add a3, a3, a5 # Bump pointer b vwmul.vv v8, v0, v4 # 64b result in v8-v15

vsetvli zero, a0, vsew64,vlmul8  # Operate on 64b values, discard

new AVL as it's the same vld.v v16, (a1) # Get 64b vector dst into v16-v23 vadd.vv v16, v16, v8 # add 64b elements in v8-v15 to v16-v23 vsd.v v16, (a1) # Store vector of 64b slli a5, a4, 3 # multiply AVL by element size 8 bytes add a1, a1, a5 # Bump pointer dst sub a0, a0, a4 # subtract AVL from n to get remaining count bnez a0, foo # Any more? ret

The alternative of course is to set up for 64 bit elements at the outset, let the two vlw.v's for a and b widen the 32 bit loads into 64 bit elements, then do 64x64->64 multiplies. The code would be two instructions shorter, saving one of the vsetvli (4 bytes) and one of the shifts (2 bytes).

Assuming for the moment a 512 bit (64 byte) vector register size (total vector register file 2 KB). this function initially sets the MVL to 64 (2048 bits divided into 32-bit elements). The widening multiply produces 64 64-bit elements. The second half of the loop then sets the element size to 64 bits and doubles the vlmul, so the MVL is still 64 (4096 bits divided into 64-bit elements). The load, add, and store of dst then takes place using 64 bit calculations.

Except on the last iteration [1] the AVL will be the same as the MVL. Both will change (in bits, not in number of elements in this case) twice in each loop.

[1] if on the 2nd to last iteration there are, say, 72 elements left, the vsetvli instruction might choose to return an AVL of 36 elements, leaving 36 for the last iteration, rather than doing 64 and then leaving only 8 for the last iteration. Or maybe 48 and 24, or 40 and 32 depending on what suits that particular hardware. Or maybe it will equalise the last three or four or more iterations. The main rule is the AVL must decrease monotonically.



More information about the llvm-dev mailing list