[llvm-dev] [RFC] Vector Predication (original) (raw)

Bruce Hoult via llvm-dev llvm-dev at lists.llvm.org
Fri Feb 1 01🔞49 PST 2019


On Thu, Jan 31, 2019 at 11:53 PM Luke Kenneth Casson Leighton via llvm-dev <llvm-dev at lists.llvm.org> wrote:

--- crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68 On Thu, Jan 31, 2019 at 10:22 PM Jacob Lifshay <programmerjake at gmail.com> wrote: > > We're in-progress designing a RISC-V extension (http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-January/000433.html) that would have variable-length vectors of short vectors (1 to 4): > <VL x <4 x float>> > where each predicate bit masks out a whole short vector. We're using this extension to vectorize graphics code where where variables in the pre-vectorization code are short vectors. > So, vectorizing code like: > for(int i = 0; i < 1000; i++)_ _> { > vec4 color = colors[i]; > vec3 normal = normals[i]; > color.rgb *= fmax(0.0, dot(normal, lightdir)); > colors[i] = color; > } > > I'm planning on passing already vectorized code into LLVM and using LLVM as a backend for optimization and JIT code generation. > > Do you think the EVL proposal would support an ISA like this as it's currently > written (by pattern matching on predicate expansion and vector-length > multiplication)? whilst it may be tempting to suggest that a solution is to multiply up the bits in the predicate (into groups of 3 or 4), the problem with that is that if there are operations that require vec3 or vec4 as operands interspersed with predicated operations that do not, that realistically implies a need for two separate predicate registers, otherwise cycles are wasted swapping predicates OR it implies that the architecture allows two separate predicate registers to be selected. consequently, it would be much, much better to be able to have a single bit of a predicate apply to the entire vec3 or vec4 type, on each outer loop.

This situation can be handled easily in the standard RISC-V vector extension. You'd do something like...

vsetvli t0, a0, vsew128,vnreg8,vdiv4

... to configure the vector unit to provide eight vector register variables divided into a standard element width of 128 bits (some instructions will widen or narrow one step to/from 64 bits or 256 bits), and then dividing each 128 bit element into 4 parts.

Arithmetic/logical/shift will happen on 32 bit elements, but predication and loads and stores (including strided or scatter/gather) will operate on 128 bit elements.

[I just made up "vnreg8" as an alias for the standard "vlmul4" because "vlmul4,vdiv4" might look confusing. Either way it means to put 0b10 into bits [1:0] of the vtype CSR specifying that the 32 vector registers should be ganged into 8 groups each 4x longer than standard because (I'm assuming) we need more than four vector registers in this loop, but no more than eight]



More information about the llvm-dev mailing list