[llvm-dev] [RFC] Vector Predication (original) (raw)
Bruce Hoult via llvm-dev llvm-dev at lists.llvm.org
Tue Feb 5 03:06:17 PST 2019
- Previous message: [llvm-dev] [RFC] Vector Predication
- Next message: [llvm-dev] [RFC] Vector Predication
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Tue, Feb 5, 2019 at 1:23 AM Simon Moll <moll at cs.uni-saarland.de> wrote:
I think this is the usual mixup of AVL and MVL.
AVL: is part of the predicate and can change between vector operations just like a mask can (light weight). MVL: Is the physical vector register length and can be re-configured per function (RVV only atm) - (heavy weight, stop-the-world instruction). The vectorlen parameter in EVL intrinsics is for the AVL.
Unless I misunderstand, this doesn't describe RVV correctly, although this is understandable as the spec has moved around a bit in the last six or twelve months as it's gotten closer to being set in stone.
The way it has ended up (very unlikely to change now) is:
any given RVV vector unit has 32 registers each with the same and fixed length in bits.
the vector unit is configured by the VSETVL[I] instruction which has two arguments: 1) the requested AVL, and 2) the vtype (vector type).
The vtype is an integer with several small fields, of which two are currently defined (the other bits must be zero). The fields are the Standard Element Width and VLMul. SEW can be any power of 2 from 8 bits up to some implementation-defined maximum (1024 bits absolute maximum). VLMul says that you don't actually need 32 distinct vector variables in your current loop/function and you're willing to trade number of registers for a larger MVL. So, you can gang together each even/odd register pair into 16 longer registers (named 0,2,4...30), or you can gang together groups of four or at most eight registers.
the current MVL -- the maximum number of elements in a vector register -- is the hardware register length, multiplied by the VLMul field in vtype, divided by the SEW field in vtype.
the AVL is the smaller of MVL and the requested AVL.
only two things can change AVL: the VSETVL[I] instruction, and a special kind of memory load: "Unit-stride First-Fault Loads" if the load crosses a protection boundary and the tail of the vector is inaccessible. This kind of load is relatively uncommon and exists so you can vectorise things where the end of the application vector is data-dependent rather than counted. The canonical example is strlen()/strcpy(). For most code you can ignore it and say the AVL changes only when you execute VSETVL[I].
any time the program uses VSETVL[I] both the MVL and the AVL can change.
the common case is a loop with the vtype in an immediate VSETVLI at the head of the loop. In this case, the AVL potentially changes in every iteration of the loop (but usually only in the last one or two iterations). As the vtype is in an immediate it can't change from iteration to iteration. But it's common for two loops in the same function to use different vtype, and so different MVL, because the loops might either operate on different data types, or need a different number of vector variables in the loop, or both.
VSETVL[I] is not heavyweight, even if it changes the MVL. It's quite ok to execute it as much as you want -- even before every vector instruction if you want. That would be pretty unusual, and I think falls more into the "clever hand-written code" area than into anything a compiler is likely to want to generate from C loops, although it's certainly possible.
Here's an example:
void foo(size_t n, int64_t *dst, int32_t *a, int32_t *b){ for (size_t i=0; i<n; ++i) dst[i] += a[i] * b[i]; }
If 32x32->64 multiplies are cheaper than 64x64->64 multiplies then you might want to compile this to:
args n in a0, dst in a1, a in a2, b in a3, AVL in t0
foo: vsetvli a4, a0, vsew32,vlmul4 # vtype = 32-bit integer vectors, AVL in a4 vlw.v v0, (a2) # Get 32b vector a into v0-v3 vlw.v v4, (a3) # Get 32b vector b into v4-v7 slli a5, a4, 2 # multiply AVL by element size 4 bytes add a2, a2, a5 # Bump pointer a add a3, a3, a5 # Bump pointer b vwmul.vv v8, v0, v4 # 64b result in v8-v15
vsetvli zero, a0, vsew64,vlmul8 # Operate on 64b values, discard
new AVL as it's the same vld.v v16, (a1) # Get 64b vector dst into v16-v23 vadd.vv v16, v16, v8 # add 64b elements in v8-v15 to v16-v23 vsd.v v16, (a1) # Store vector of 64b slli a5, a4, 3 # multiply AVL by element size 8 bytes add a1, a1, a5 # Bump pointer dst sub a0, a0, a4 # subtract AVL from n to get remaining count bnez a0, foo # Any more? ret
The alternative of course is to set up for 64 bit elements at the outset, let the two vlw.v's for a and b widen the 32 bit loads into 64 bit elements, then do 64x64->64 multiplies. The code would be two instructions shorter, saving one of the vsetvli (4 bytes) and one of the shifts (2 bytes).
Assuming for the moment a 512 bit (64 byte) vector register size (total vector register file 2 KB). this function initially sets the MVL to 64 (2048 bits divided into 32-bit elements). The widening multiply produces 64 64-bit elements. The second half of the loop then sets the element size to 64 bits and doubles the vlmul, so the MVL is still 64 (4096 bits divided into 64-bit elements). The load, add, and store of dst then takes place using 64 bit calculations.
Except on the last iteration [1] the AVL will be the same as the MVL. Both will change (in bits, not in number of elements in this case) twice in each loop.
[1] if on the 2nd to last iteration there are, say, 72 elements left, the vsetvli instruction might choose to return an AVL of 36 elements, leaving 36 for the last iteration, rather than doing 64 and then leaving only 8 for the last iteration. Or maybe 48 and 24, or 40 and 32 depending on what suits that particular hardware. Or maybe it will equalise the last three or four or more iterations. The main rule is the AVL must decrease monotonically.
- Previous message: [llvm-dev] [RFC] Vector Predication
- Next message: [llvm-dev] [RFC] Vector Predication
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]