Strange branching performance (original) (raw)

Martin Grajcar maaartinus at gmail.com
Sat Feb 15 03:06:15 PST 2014

Previous message: Strange branching performance
Next message: Strange branching performance
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi Vladimir,

On Fri, Feb 14, 2014 at 7:46 PM, Vladimir Kozlov <vladimir.kozlov at oracle.com

wrote:

Second optimization which converts if (P == Q) { X+Y } to data flow

only:

cmp RDX, R9 # caddcmpEQMask seteq RDX movzb RDX, RDX add RAX, RDX The code above is for increment: if (P == Q) { X+1 } and the direction is from right to left operand.

I hope I can get used to RTL soon. It's perfectly clear now.

For general case it has additional instructions before add: neg RDX and RDX, RCX

I see. The choice of +1 rather then -1 for true is rather unfortunate. And so is the operand size for SETcc.

I'm not sure about the above snippet. If it's counting only, then I'd

imagine doing just

cmp RDX, R9 adc $0, RAX Equality test does not set carry flag. You code is for if (P < Q) { X+1 }

Yes, this was actually intentional to show the optimization I had in mind, but you're doing it already. So forget it.

My manually written assembly runs in 430 (it looks like we're using the

same units and my computer is slightly slower) and it looks like this:

"movl %edi, %r15d\n" // i+0 "andl %esi, %r15d\n" // (i+0) & mask "addl $-1, %r15d\n" // carry = ((i+0) & mask) ? 1 : 0 "adcl $0, %eax\n" // result += carry "leal 1(%edi), %r15d\n" // (i+1) "andl %esi, %r15d\n" // (i+1) & mask "addl $-1, %r15d\n" // carry = ((i+1) & mask) ? 1 : 0 "adcl $0, %eax\n" // result += carry

Lea instruction could be bottleneck because it use address unit.

Good to know. Whatever I tried I can't beat the BlockLayoutByFrequency. It's logical as correctly predicted not-taken branch is sort of free.

But I'm pretty close: 0.46 vs 0.40 seconds for limit=1e9 and mask=15. Any nontrivial unpredictability would make the non-branching solution win.

An idea: What about considering all branches dependent on array loads as rather unpredictable and lower the BlockLayoutByFrequency for them? It's just a guess but it would allow for both benchmarks to be fast and it will be right more often than not.

Regards, Martin. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20140215/2c360a35/attachment.html

Previous message: Strange branching performance
Next message: Strange branching performance
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the hotspot-compiler-dev mailing list