Strange branching performance (original) (raw)
Vladimir Kozlov vladimir.kozlov at oracle.com
Fri Feb 14 10:46:02 PST 2014
- Previous message: Strange branching performance
- Next message: Strange branching performance
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On 2/13/14 8:12 PM, Martin Grajcar wrote:
Hi Vladimir,
On Fri, Feb 14, 2014 at 2:03 AM, Vladimir Kozlov <vladimir.kozlov at oracle.com <mailto:vladimir.kozlov at oracle.com>> wrote: First optimization, which replaced (CmpI (AndI src mask) zero) with (TestI src mask), gave slight improvement in my test. Second optimization which converts if (P == Q) { X+Y } to data flow only: cmp RDX, R9 # caddcmpEQMask seteq RDX movzb RDX, RDX add RAX, RDX
The code above is for increment: if (P == Q) { X+1 } and the direction is from right to left operand. For general case it has additional instructions before add: neg RDX and RDX, RCX
gave improvement for JmhBranchingBenchmark test even above cmov code (cmov is still generated after 19% - it is separate problem):
I'm not sure about the above snippet. If it's counting only, then I'd imagine doing just cmp RDX, R9 adc $0, RAX
Equality test does not set carry flag. You code is for if (P < Q) { X+1 }
as I wrote in my last email a few minutes ago. PERCENTAGE: MEAN MIN MAX UNIT branchless: 8.511 8.475 8.547 ops/ms 5: 9.756 9.709 9.804 ops/ms 10: 9.709 9.709 9.709 ops/ms 15: 9.756 9.709 9.804 ops/ms 16: 9.709 9.709 9.709 ops/ms 17: 9.756 9.709 9.804 ops/ms 18: 9.756 9.709 9.804 ops/ms 19: 9.133 9.091 9.174 ops/ms 20: 9.133 9.091 9.174 ops/ms 30: 9.133 9.091 9.174 ops/ms 40: 9.133 9.091 9.174 ops/ms 50: 9.133 9.091 9.174 ops/ms vs branches:
PERCENTAGE: MEAN MIN MAX UNIT branchless: 8.511 8.475 8.547 ops/ms 5: 8.889 8.850 8.929 ops/ms 10: 5.716 5.618 5.814 ops/ms 15: 4.320 4.310 4.329 ops/ms 16: 4.175 4.167 4.184 ops/ms 17: 3.929 3.922 3.937 ops/ms 18: 9.133 9.091 9.174 ops/ms 19: 9.133 9.091 9.174 ops/ms 20: 9.133 9.091 9.174 ops/ms 30: 9.133 9.091 9.174 ops/ms 40: 9.133 9.091 9.174 ops/ms 50: 9.133 9.091 9.174 ops/ms Unfortunately for my test it gave regression but smaller then when using cmov: testi time: 687 vs base testi time: 402 vs cmov testi time: 785 My manually written assembly runs in 430 (it looks like we're using the same units and my computer is slightly slower) and it looks like this: "movl %edi, %r15d\n" // i+0 "andl %esi, %r15d\n" // (i+0) & mask "addl $-1, %r15d\n" // carry = ((i+0) & mask) ? 1 : 0 "adcl $0, %eax\n" // result += carry "leal 1(%edi), %r15d\n" // (i+1) "andl %esi, %r15d\n" // (i+1) & mask "addl $-1, %r15d\n" // carry = ((i+1) & mask) ? 1 : 0 "adcl $0, %eax\n" // result += carry
Lea instruction could be bottleneck because it use address unit.
Unfortunately, the AND before TEST removing and the ADC optimizations are mutually exclusive.
Yes.
Thanks, Vladimir
Regards, Martin.
- Previous message: Strange branching performance
- Next message: Strange branching performance
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the hotspot-compiler-dev mailing list