(original) (raw)
Hi Vladimir,
On Fri, Feb 14, 2014 at 7:46 PM, Vladimir Kozlov <vladimir.kozlov@oracle.com> wrote:
On Fri, Feb 14, 2014 at 7:46 PM, Vladimir Kozlov <vladimir.kozlov@oracle.com> wrote:
� � Second optimization which converts if (P == Q) { X+Y } to data flow
� � only:
� � � � � � �cmp � � RDX, R9 # cadd\_cmpEQMask
� � � � � � �seteq � RDX
� � � � � � �movzb � RDX, RDX
� � � � � � �add � � RAX, RDX
The code above is for increment: if (P == Q) { X+1 } and the direction is from right to left operand.
I hope I can get used to RTL soon. It's perfectly clear now.
�
For general case it has additional instructions before add:
� � � � � � � �neg � � RDX
� � � � � � � �and � � RDX, RCX
I see. The choice of +1 rather then -1 for true is rather unfortunate. And so is the operand size for SETcc.
Equality test does not set carry flag. You code is for if (P < Q) { X+1 }I'm not sure about the above snippet. If it's counting only, then I'd
imagine doing just
cmp � � RDX, R9
adc � � $0, RAX
Yes, this was actually intentional to show the optimization I had in mind, but you're doing it already. So forget it.
�
Lea instruction could be bottleneck because it use address unit.My manually written assembly runs in 430 (it looks like we're using the
same units and my computer is slightly slower) and it looks like this:
"movl %edi, %r15d\\n" // i+0
"andl %esi, %r15d\\n" // (i+0) & mask
"addl $-1, %r15d\\n" �// carry = ((i+0) & mask) ? 1 : 0
"adcl $0, %eax\\n" // result += carry
"leal 1(%edi), %r15d\\n" // (i+1)
"andl %esi, %r15d\\n" �// (i+1) & mask
"addl $-1, %r15d\\n" // carry = ((i+1) & mask) ? 1 : 0
"adcl $0, %eax\\n" �// result += carry
Good to know. �Whatever I tried I can't beat the�BlockLayoutByFrequency.� It's logical as correctly predicted not-taken branch is sort of free.
But I'm pretty close: 0.46 vs 0.40 seconds for limit=1e9 and mask=15\. Any nontrivial unpredictability would make the non-branching solution win.
An idea: What about considering all branches dependent on array loads as rather unpredictable and lower the�BlockLayoutByFrequency for them? It's just a guess but it would allow for both benchmarks to be fast and it will be right more often than not.
Regards,
Martin.