optimizing acmp in L-World (original) (raw)
Sergey Kuksenko sergey.kuksenko at oracle.com
Thu Jul 26 01:14:51 UTC 2018
- Previous message (by thread): hg: valhalla/valhalla: 8207788: [lworld] multinewarray bytecode should check for ValueType inconsistency
- Next message (by thread): optimizing acmp in L-World
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
I've measured new Valhalla acmp code on legacy (non-value types code). Performance regression of acmp operation is 10%-20% comparing to before valhalla world. It was measured on microbenchmarks and looks like it shouldn't be significantly visible on real apps except some corner cases. I to suggest to a small improvement which make sense to try.
I really that trick:
bool newacmp(ptr x, ptr y) { if (x == null) return y == null; int bit = (x->classinheader & QTYPEBIT); ptr x1 = x | swarbroadcastbit(bit != 0); return oldacmp(x1, y); } int swarbroadcastbit(bool z) { return z ? -1 : 0; // use ALU ops } And right now the code checking and mixing if class is value looks like:
mov 0x8(%rcx),%r10d // load class ptr shr $0x3,%r10 and $0x1,%r10 // here r10 is 1 if value, 0 if not value add %r10,%rcx // mix with value bit
Suppose there is a branch free technique for implementing isvalueptr With the current is_value_ptr scheme I want to suggest branch free technique and return to initial idea of codegeneration So the logic looks like this, and the new semantics are in the second "if" statement:
bool newacmp(ptr x, ptr y) { if (!oldacmp(x, y)) return false; // x != y => false if (isvalueptr(x)) return false; // x.isvalue => false return true; } bool oldacmp(ptr x, ptr y) { return x == y; // cmpq etc. } I'd rather rewrite to:
bool new_acmp(ptr x, ptr y) { if (!old_acmp(x, y)) return false; // x != y => false return x!=null && !is_value_ptr(x); }
and the following will get us required 0 and 1 from this
mov 0x8(%rcx),%r10d // load class ptr shr $0x3,%r10 and $0x1,%r10 // here r10 is 1 if value, 0 if not value xor $0x1,%r10 // here r10 false if value, true if not value.
Checking currently generated code I'd split it into traces and count asm instructions, and count asm instructions which could new in suggested codegeneration: (doesn't count unconditional branches)
case x==null & y==null: current code: 3 ALU ops + 2 branches suggested code: 3 ALU ops + 2 branches
case x==null & y!=null: current code: 3 ALU ops + 2 branches suggested code: 2 ALU ops + 1 branches
case x==y: current code: 6 ALU ops + 2 branches + 1 load from memory suggested code: 5 ALU ops + 2 branches + 1 load from memory
case x!=y: current code: 6 ALU ops + 2 branches + 1 load from memory suggested code: 2 ALU ops + 1 branch
In some cases acmp cost is the same as in before valhalla world (2 ALU ops + 1 branch)
The most important fact here is not a quite small decrease of ALU instructions, which honestly won't be visible even on microbenchmarks, but the fact in case of x!=y we eliminate load from memory which even in hit to L1 cache will cost as all other instructions and much more in case of cache miss.
The final suggested code (simplified):
cmp rx,ry je equal xor eax,eax jmp end
equal: test rx,rx je is_null mov klass ptr from memory to eax shr 3,eax and 1,eax xor 1,eax jmp end
is_null: mov 1,eax jmp end
end: // eax contains result
- Previous message (by thread): hg: valhalla/valhalla: 8207788: [lworld] multinewarray bytecode should check for ValueType inconsistency
- Next message (by thread): optimizing acmp in L-World
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]