Strange branching performance (original) (raw)
Martin Grajcar maaartinus at gmail.com
Sat Feb 15 03:06:15 PST 2014
- Previous message: Strange branching performance
- Next message: Strange branching performance
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hi Vladimir,
On Fri, Feb 14, 2014 at 7:46 PM, Vladimir Kozlov <vladimir.kozlov at oracle.com
wrote:
Second optimization which converts if (P == Q) { X+Y } to data flow
only:
cmp RDX, R9 # caddcmpEQMask seteq RDX movzb RDX, RDX add RAX, RDX The code above is for increment: if (P == Q) { X+1 } and the direction is from right to left operand.
I hope I can get used to RTL soon. It's perfectly clear now.
For general case it has additional instructions before add: neg RDX and RDX, RCX
I see. The choice of +1 rather then -1 for true is rather unfortunate. And so is the operand size for SETcc.
I'm not sure about the above snippet. If it's counting only, then I'd
imagine doing just
cmp RDX, R9 adc $0, RAX Equality test does not set carry flag. You code is for if (P < Q) { X+1 }
Yes, this was actually intentional to show the optimization I had in mind, but you're doing it already. So forget it.
My manually written assembly runs in 430 (it looks like we're using the
same units and my computer is slightly slower) and it looks like this:
"movl %edi, %r15d\n" // i+0 "andl %esi, %r15d\n" // (i+0) & mask "addl $-1, %r15d\n" // carry = ((i+0) & mask) ? 1 : 0 "adcl $0, %eax\n" // result += carry "leal 1(%edi), %r15d\n" // (i+1) "andl %esi, %r15d\n" // (i+1) & mask "addl $-1, %r15d\n" // carry = ((i+1) & mask) ? 1 : 0 "adcl $0, %eax\n" // result += carry
Lea instruction could be bottleneck because it use address unit.
Good to know. Whatever I tried I can't beat the BlockLayoutByFrequency. It's logical as correctly predicted not-taken branch is sort of free.
But I'm pretty close: 0.46 vs 0.40 seconds for limit=1e9 and mask=15. Any nontrivial unpredictability would make the non-branching solution win.
An idea: What about considering all branches dependent on array loads as rather unpredictable and lower the BlockLayoutByFrequency for them? It's just a guess but it would allow for both benchmarks to be fast and it will be right more often than not.
Regards, Martin. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20140215/2c360a35/attachment.html
- Previous message: Strange branching performance
- Next message: Strange branching performance
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the hotspot-compiler-dev mailing list