AVX2 popcount regression (original) (raw)

February 11, 2024, 4:37pm 1

With -march=x86-64-v3 clang compiles the code

#include <inttypes.h>

int popcount8(uint64_t data[8]) {
  int count = 0;
  for (int i = 0; i < 8; ++i)
    count += __builtin_popcountll(data[i]);
  return count;
}

to some fairly complicated AVX2 code (compiler explorer), while gcc just issues eight popcnt instructions (with an extra instruction to clear the output register because of an old false dependency bug). At least on my machine (AMD Ryzen 7 3700U) the popcnt version is 3x faster in a simple benchmark:

#include <inttypes.h>
#include <stdio.h>
#include <stdlib.h>

extern int popcount8_gcc(uint64_t *);
extern int popcount8_clang(uint64_t *);

int main() {
        uint64_t xs[8];
        FILE *fd = fopen("/dev/urandom", "rb");
        if (fread(xs, 1, 64, fd) != 64) exit(1);
        for (long i = 0; i < 1000000000; ++i) {
                popcount8_clang(xs);
        }
        return 0;
}

llvm-mca in some cases estimates that the gcc alternative is faster. Is there any way to encourage LLVM to just use popcnt with march=x86-64-v3?

RKSimon February 12, 2024, 11:09am 2

It looks like the issue isn’t the vectorized popcount expansion but the truncation feeding the reduction:

define i32 @popcount8(ptr nocapture noundef readonly %data)  {
entry:
  %0 = load <8 x i64>, ptr %data, align 8
  %1 = tail call <8 x i64> @llvm.ctpop.v8i64(<8 x i64> %0)
  %2 = trunc <8 x i64> %1 to <8 x i32>
  %3 = tail call i32 @llvm.vector.reduce.add.v8i32(<8 x i32> %2)
  ret i32 %3
}
declare <8 x i64> @llvm.ctpop.v8i64(<8 x i64>) #1
declare i32 @llvm.vector.reduce.add.v8i32(<8 x i32>) #1

if we did this instead we get a notable boost:

define i32 @popcount8(ptr nocapture noundef readonly %data)  {
entry:
  %0 = load <8 x i64>, ptr %data, align 8
  %1 = tail call <8 x i64> @llvm.ctpop.v8i64(<8 x i64> %0)
  %2 = tail call i64 @llvm.vector.reduce.add.v8i64 (<8 x i64 > %1)
  %3 = trunc i64 %2 to i32
  ret i32 %3
}
declare <8 x i64> @llvm.ctpop.v8i64(<8 x i64>) #1
declare i64 @llvm.vector.reduce.add.v8i64(<8 x i64>)

RKSimon February 12, 2024, 11:12am 3

Please can you try this instead:

#include <inttypes.h>

uint64_t popcount8(uint64_t data[8]) {
  uint64_t count = 0;
  for (int i = 0; i < 8; ++i)
    count += __builtin_popcountll(data[i]);
  return count;
}

RKSimon February 12, 2024, 1:17pm 4

user16251 February 13, 2024, 2:26am 5

Thanks for looking into this. The non-truncating version doesn’t appear to make a difference in runtime on my machine (3.09s for gcc, 9.41s for clang). llvm-mca reports that the clang version will take 51.1 cycles per iteration and that the gcc alternative (8 popcnt and 7 add instructions) will take 17.9 cycles.

RKSimon February 13, 2024, 8:03am 6

Please can you post a godbolt link to your mca codegen?

This is what I’m seeing: Compiler Explorer (x86-64-v3 defaults to Haswell model stats)

Sure, sorry about the late response. Here are some examples with the regression:

zen3
zen1; this is closer to the numbers I posted.

You can also see a discrepancy for Haswell with -iterations=1.