[llvm-dev] how experimental are the llvm.experimental.vector.reduce.* functions? (original) (raw)

Andrew Kelley via llvm-dev llvm-dev at lists.llvm.org
Sat Feb 9 12:56:25 PST 2019


On 2/9/19 2:05 PM, Craig Topper wrote:

Something like this should work I think.

; ModuleID = 'test.ll' sourcefilename = "test.ll" define void @entry(<4 x i32>* %a, <4 x i32>* %b, <4 x i32>* %x) { Entry: %tmp = load <4 x i32>, <4 x i32>* %a, align 16 %tmp1 = load <4 x i32>, <4 x i32>* %b, align 16 %tmp2 = add <4 x i32> %tmp, %tmp1 %tmpsign = icmp slt <4 x i32> %tmp, zeroinitializer %tmp1sign = icmp slt <4 x i32> %tmp1, zeroinitializer %sumsign = icmp slt <4 x i32> %tmp2, zeroinitializer %signsequal = icmp eq <4 x i1> %tmpsign, %tmp1sign %summismatch = icmp ne <4 x i1> %sumsign, %tmpsign %overflow = and <4 x i1> %signsequal, %summismatch %tmp5 = bitcast <4 x i1> %overflow to i4 %tmp6 = icmp ne i4 %tmp5, 0 br i1 %tmp6, label %OverflowFail, label %OverflowOk OverflowFail:                                     ; preds = %Entry tail call fastcc void @panic() unreachable OverflowOk:                                       ; preds = %Entry store <4 x i32> %tmp2, <4 x i32>* %x, align 16 ret void } declare fastcc void @panic()

Thanks! I was able to get it working with your hint:

%tmp5 = bitcast <4 x i1> %overflow to i4

(Thanks also to LebedevRI who pointed this out on IRC)

Until LLVM 9 when the llvm..with.overflow. intrinsics gain vector support, here's what I ended up with:

%a = alloca <4 x i32>, align 16 %b = alloca <4 x i32>, align 16 %x = alloca <4 x i32>, align 16 store <4 x i32> <i32 1, i32 2, i32 3, i32 4>, <4 x i32>* %a, align 16, !dbg !55 store <4 x i32> <i32 5, i32 6, i32 7, i32 8>, <4 x i32>* %b, align 16, !dbg !56 %0 = load <4 x i32>, <4 x i32>* %a, align 16, !dbg !57 %1 = load <4 x i32>, <4 x i32>* %b, align 16, !dbg !58 %2 = sext <4 x i32> %0 to <4 x i33>, !dbg !59 %3 = sext <4 x i32> %1 to <4 x i33>, !dbg !59 %4 = add <4 x i33> %2, %3, !dbg !59 %5 = trunc <4 x i33> %4 to <4 x i32>, !dbg !59 %6 = sext <4 x i32> %5 to <4 x i33>, !dbg !59 %7 = icmp ne <4 x i33> %4, %6, !dbg !59 %8 = bitcast <4 x i1> %7 to i4, !dbg !59 %9 = icmp ne i4 %8, 0, !dbg !59 br i1 %9, label %OverflowFail, label %OverflowOk, !dbg !59

Idea being: extend and do the operation with more bits. Truncate to get the result. Re-extend the result and check if it is the same as the pre-truncated result.

This works pretty well unless the vector integer size is as big or larger than the native vector register. Here's a quick performance test:

https://gist.github.com/andrewrk/b9734f9c310d8b79ec7271e7c0df4023

Summary: safety-checked integer addition with no optimizations

<4 x i32>: scalar = 893 MiB/s vector = 3.58 GiB/s

<16 x i128>: scalar = 3.6 GiB/s vector = 2.5 GiB/s

-------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: OpenPGP digital signature URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190209/ddeb0994/attachment.sig>



More information about the llvm-dev mailing list