Strange branching performance (original) (raw)

Vladimir Kozlov vladimir.kozlov at oracle.com
Fri Feb 14 10:46:02 PST 2014


On 2/13/14 8:12 PM, Martin Grajcar wrote:

Hi Vladimir,

On Fri, Feb 14, 2014 at 2:03 AM, Vladimir Kozlov <vladimir.kozlov at oracle.com <mailto:vladimir.kozlov at oracle.com>> wrote: First optimization, which replaced (CmpI (AndI src mask) zero) with (TestI src mask), gave slight improvement in my test. Second optimization which converts if (P == Q) { X+Y } to data flow only: cmp RDX, R9 # caddcmpEQMask seteq RDX movzb RDX, RDX add RAX, RDX

The code above is for increment: if (P == Q) { X+1 } and the direction is from right to left operand. For general case it has additional instructions before add: neg RDX and RDX, RCX

gave improvement for JmhBranchingBenchmark test even above cmov code (cmov is still generated after 19% - it is separate problem):

I'm not sure about the above snippet. If it's counting only, then I'd imagine doing just cmp RDX, R9 adc $0, RAX

Equality test does not set carry flag. You code is for if (P < Q) { X+1 }

as I wrote in my last email a few minutes ago. PERCENTAGE: MEAN MIN MAX UNIT branchless: 8.511 8.475 8.547 ops/ms 5: 9.756 9.709 9.804 ops/ms 10: 9.709 9.709 9.709 ops/ms 15: 9.756 9.709 9.804 ops/ms 16: 9.709 9.709 9.709 ops/ms 17: 9.756 9.709 9.804 ops/ms 18: 9.756 9.709 9.804 ops/ms 19: 9.133 9.091 9.174 ops/ms 20: 9.133 9.091 9.174 ops/ms 30: 9.133 9.091 9.174 ops/ms 40: 9.133 9.091 9.174 ops/ms 50: 9.133 9.091 9.174 ops/ms vs branches:

PERCENTAGE: MEAN MIN MAX UNIT branchless: 8.511 8.475 8.547 ops/ms 5: 8.889 8.850 8.929 ops/ms 10: 5.716 5.618 5.814 ops/ms 15: 4.320 4.310 4.329 ops/ms 16: 4.175 4.167 4.184 ops/ms 17: 3.929 3.922 3.937 ops/ms 18: 9.133 9.091 9.174 ops/ms 19: 9.133 9.091 9.174 ops/ms 20: 9.133 9.091 9.174 ops/ms 30: 9.133 9.091 9.174 ops/ms 40: 9.133 9.091 9.174 ops/ms 50: 9.133 9.091 9.174 ops/ms Unfortunately for my test it gave regression but smaller then when using cmov: testi time: 687 vs base testi time: 402 vs cmov testi time: 785 My manually written assembly runs in 430 (it looks like we're using the same units and my computer is slightly slower) and it looks like this: "movl %edi, %r15d\n" // i+0 "andl %esi, %r15d\n" // (i+0) & mask "addl $-1, %r15d\n" // carry = ((i+0) & mask) ? 1 : 0 "adcl $0, %eax\n" // result += carry "leal 1(%edi), %r15d\n" // (i+1) "andl %esi, %r15d\n" // (i+1) & mask "addl $-1, %r15d\n" // carry = ((i+1) & mask) ? 1 : 0 "adcl $0, %eax\n" // result += carry

Lea instruction could be bottleneck because it use address unit.

Unfortunately, the AND before TEST removing and the ADC optimizations are mutually exclusive.

Yes.

Thanks, Vladimir

Regards, Martin.



More information about the hotspot-compiler-dev mailing list