8309130: x86_64 AVX512 intrinsics for Arrays.sort methods (int, long, float and double arrays) by vamsi-parasa · Pull Request #14227 · openjdk/jdk (original) (raw)

my question is that this feature should improve performance several times, but it doesn't look like there's much difference between open jdk 22.19 and jdk 8. is there a problem with my configuration ?

Hello @himichael,

Using your code snippet, please see the output below using the latest JDK and JDK 20 (which does not have AVX512 sort):

JDK 20 (without AVX512 sort): java -XX:CompileCommand=CompileThresholdScaling,java.util.DualPivotQuicksort::sort,0.0001 -XX:-TieredCompilation JDKSort

elapse time -> 7501 ms

JDK 22 (with AVX512 sort) java -XX:CompileCommand=CompileThresholdScaling,java.util.DualPivotQuicksort::sort,0.0001 -XX:-TieredCompilation JDKSort elapse time -> 1607 ms

It shows 4.66x speedup.

Hello, @vamsi-parasa
I used the commands you provided, but nothing seems to have changed.
The test procedure as follow:
use JDK 8(without AVX512 sort)

/data/soft/jdk1.8.0_371/bin/javac JDKSort.java /data/soft/jdk1.8.0_371/bin/java JDKSort

elapse time -> 15309 ms

use OpenJDK 22.19(with AVX512 sort)

/data/soft/jdk-22/bin/javac JDKSort.java /data/soft/jdk-22/bin/java -XX:CompileCommand=CompileThresholdScaling,java.util.DualPivotQuicksort::sort,0.0001 -XX:-TieredCompilation JDKSort CompileCommand: CompileThresholdScaling java/util/DualPivotQuicksort.sort double CompileThresholdScaling = 0.000100

elapse time -> 11687 ms

Not much seems to have changed.

My JDK info:
OpenJDK 22.19:

/data/soft/jdk-22/bin/java -version openjdk version "22-ea" 2024-03-19 OpenJDK Runtime Environment (build 22-ea+19-1460) OpenJDK 64-Bit Server VM (build 22-ea+19-1460, mixed mode, sharing)

JDK 8:

/data/soft/jdk1.8.0_371/bin/java -version java version "1.8.0_371" Java(TM) SE Runtime Environment (build 1.8.0_371-b11) Java HotSpot(TM) 64-Bit Server VM (build 25.371-b11, mixed mode)

I tested Intel's x86-simd-sort, my code as follow:

#include #include #include #include #include "src/avx512-32bit-qsort.hpp"

int main() {

// 100 million records
const int size = 100000000;
std::vector<int> random_array(size);

for (int i = 0; i < size; ++i) {
    random_array[i] = rand();
}

auto start_time = std::chrono::steady_clock::now();

avx512_qsort(random_array.data(), size);

auto end_time = std::chrono::steady_clock::now();
auto elapse_time = std::chrono::duration_cast<std::chrono::milliseconds>(end_time - start_time).count();

std::cout << "elapse time -> " << elapse_time << " ms" << std::endl;
return 0;

}

compile commands:

g++ -o sort -O3 -mavx512f -mavx512dq sort.cpp

elapse time -> 1151 ms
An order of magnitude performance improvement.

Here is my cpu information:

Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 Thread(s) per core: 1 Core(s) per socket: 1 Socket(s): 8 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 85 Model name: Intel Xeon Processor (Skylake, IBRS) Stepping: 4 CPU MHz: 2394.374 BogoMIPS: 4788.74 Hypervisor vendor: KVM Virtualization type: full L1d cache: 32K L1i cache: 32K L2 cache: 4096K NUMA node0 CPU(s): 0-7 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 md_clear spec_ctrl

lscpu | grep avx The following instructions are supported:

avx
avx2
avx512f
avx512dq
avx512cd
avx512bw
avx512vl