I'm not sure about the above snippet. If it's counting only, then I'd
imagine doing just

cmp � � RDX, R9
adc � � $0, RAX

Equality test does not set carry flag. You code is for if (P < Q) { X+1 }

Yes, this was actually intentional to show the optimization I had in mind, but you're doing it already. So forget it.
My manually written assembly runs in 430 (it looks like we're using the
same units and my computer is slightly slower) and it looks like this:

"movl %edi, %r15d\n" // i+0
"andl %esi, %r15d\n" // (i+0) & mask
"addl $-1, %r15d\n" �// carry = ((i+0) & mask) ? 1 : 0
"adcl $0, %eax\n" // result += carry

"leal 1(%edi), %r15d\n" // (i+1)
"andl %esi, %r15d\n" �// (i+1) & mask
"addl $-1, %r15d\n" // carry = ((i+1) & mask) ? 1 : 0
"adcl $0, %eax\n" �// result += carry


Lea instruction could be bottleneck because it use address unit.

Good to know. �Whatever I tried I can't beat the�BlockLayoutByFrequency.� It's logical as correctly predicted not-taken branch is sort of free.
But I'm pretty close: 0.46 vs 0.40 seconds for limit=1e9 and mask=15. Any nontrivial unpredictability would make the non-branching solution win.
An idea: What about considering all branches dependent on array loads as rather unpredictable and lower the�BlockLayoutByFrequency for them? It's just a guess but it would allow for both benchmarks to be fast and it will be right more often than not.
Regards,Martin.
">

(original) (raw)

Hi Vladimir,

On Fri, Feb 14, 2014 at 7:46 PM, Vladimir Kozlov <vladimir.kozlov@oracle.com> wrote:
� � Second optimization which converts if (P == Q) { X+Y } to data flow
� � only:

� � � � � � �cmp � � RDX, R9 # cadd\_cmpEQMask
� � � � � � �seteq � RDX
� � � � � � �movzb � RDX, RDX
� � � � � � �add � � RAX, RDX

The code above is for increment: if (P == Q) { X+1 } and the direction is from right to left operand.

I hope I can get used to RTL soon. It's perfectly clear now.

For general case it has additional instructions before add:

� � � � � � � �neg � � RDX

� � � � � � � �and � � RDX, RCX


I see. The choice of +1 rather then -1 for true is rather unfortunate. And so is the operand size for SETcc.

I'm not sure about the above snippet. If it's counting only, then I'd
imagine doing just

cmp � � RDX, R9
adc � � $0, RAX

Equality test does not set carry flag. You code is for if (P < Q) { X+1 }

Yes, this was actually intentional to show the optimization I had in mind, but you're doing it already. So forget it.
My manually written assembly runs in 430 (it looks like we're using the
same units and my computer is slightly slower) and it looks like this:

"movl %edi, %r15d\\n" // i+0
"andl %esi, %r15d\\n" // (i+0) & mask
"addl $-1, %r15d\\n" �// carry = ((i+0) & mask) ? 1 : 0
"adcl $0, %eax\\n" // result += carry

"leal 1(%edi), %r15d\\n" // (i+1)
"andl %esi, %r15d\\n" �// (i+1) & mask
"addl $-1, %r15d\\n" // carry = ((i+1) & mask) ? 1 : 0
"adcl $0, %eax\\n" �// result += carry


Lea instruction could be bottleneck because it use address unit.

Good to know. �Whatever I tried I can't beat the�BlockLayoutByFrequency.� It's logical as correctly predicted not-taken branch is sort of free.

But I'm pretty close: 0.46 vs 0.40 seconds for limit=1e9 and mask=15\. Any nontrivial unpredictability would make the non-branching solution win.

An idea: What about considering all branches dependent on array loads as rather unpredictable and lower the�BlockLayoutByFrequency for them? It's just a guess but it would allow for both benchmarks to be fast and it will be right more often than not.

Regards,
Martin.