(original) (raw)

Hi Vladimir,

On Fri, Feb 14, 2014 at 7:46 PM, Vladimir Kozlov <vladimir.kozlov@oracle.com> wrote:

� � Second optimization which converts if (P == Q) { X+Y } to data flow
� � only:

� � � � � � �cmp � � RDX, R9 # cadd\_cmpEQMask
� � � � � � �seteq � RDX
� � � � � � �movzb � RDX, RDX
� � � � � � �add � � RAX, RDX

The code above is for increment: if (P == Q) { X+1 } and the direction is from right to left operand.

I hope I can get used to RTL soon. It's perfectly clear now.

�

For general case it has additional instructions before add:

� � � � � � � �neg � � RDX

� � � � � � � �and � � RDX, RCX

I see. The choice of +1 rather then -1 for true is rather unfortunate. And so is the operand size for SETcc.

I'm not sure about the above snippet. If it's counting only, then I'd
imagine doing just

cmp � � RDX, R9
adc � � $0, RAX

Equality test does not set carry flag. You code is for if (P < Q) { X+1 }

Yes, this was actually intentional to show the optimization I had in mind, but you're doing it already. So forget it.

�

My manually written assembly runs in 430 (it looks like we're using the
same units and my computer is slightly slower) and it looks like this:

"movl %edi, %r15d\\n" // i+0
"andl %esi, %r15d\\n" // (i+0) & mask
"addl $-1, %r15d\\n" �// carry = ((i+0) & mask) ? 1 : 0
"adcl $0, %eax\\n" // result += carry

"leal 1(%edi), %r15d\\n" // (i+1)
"andl %esi, %r15d\\n" �// (i+1) & mask
"addl $-1, %r15d\\n" // carry = ((i+1) & mask) ? 1 : 0
"adcl $0, %eax\\n" �// result += carry

Lea instruction could be bottleneck because it use address unit.

Good to know. �Whatever I tried I can't beat the�BlockLayoutByFrequency.� It's logical as correctly predicted not-taken branch is sort of free.

But I'm pretty close: 0.46 vs 0.40 seconds for limit=1e9 and mask=15\. Any nontrivial unpredictability would make the non-branching solution win.

An idea: What about considering all branches dependent on array loads as rather unpredictable and lower the�BlockLayoutByFrequency for them? It's just a guess but it would allow for both benchmarks to be fast and it will be right more often than not.

Regards,

Martin.