Strange branching performance (original) (raw)

Vladimir Kozlov vladimir.kozlov at oracle.com
Thu Feb 13 17:03:57 PST 2014


First optimization, which replaced (CmpI (AndI src mask) zero) with (TestI src mask), gave slight improvement in my test.

Second optimization which converts if (P == Q) { X+Y } to data flow only:

     cmp     RDX, R9 # cadd_cmpEQMask
     seteq   RDX
     movzb   RDX, RDX
     add     RAX, RDX

gave improvement for JmhBranchingBenchmark test even above cmov code (cmov is still generated after 19% - it is separate problem):

PERCENTAGE: MEAN MIN MAX UNIT branchless: 8.511 8.475 8.547 ops/ms 5: 9.756 9.709 9.804 ops/ms 10: 9.709 9.709 9.709 ops/ms 15: 9.756 9.709 9.804 ops/ms 16: 9.709 9.709 9.709 ops/ms 17: 9.756 9.709 9.804 ops/ms 18: 9.756 9.709 9.804 ops/ms 19: 9.133 9.091 9.174 ops/ms 20: 9.133 9.091 9.174 ops/ms 30: 9.133 9.091 9.174 ops/ms 40: 9.133 9.091 9.174 ops/ms 50: 9.133 9.091 9.174 ops/ms

vs branches:

PERCENTAGE: MEAN MIN MAX UNIT branchless: 8.511 8.475 8.547 ops/ms 5: 8.889 8.850 8.929 ops/ms 10: 5.716 5.618 5.814 ops/ms 15: 4.320 4.310 4.329 ops/ms 16: 4.175 4.167 4.184 ops/ms 17: 3.929 3.922 3.937 ops/ms 18: 9.133 9.091 9.174 ops/ms 19: 9.133 9.091 9.174 ops/ms 20: 9.133 9.091 9.174 ops/ms 30: 9.133 9.091 9.174 ops/ms 40: 9.133 9.091 9.174 ops/ms 50: 9.133 9.091 9.174 ops/ms

Unfortunately for my test it gave regression but smaller then when using cmov:

testi time: 687 vs base testi time: 402 vs cmov testi time: 785

Vladimir

On 2/13/14 1:52 PM, Vladimir Kozlov wrote:

On 2/13/14 12:32 AM, Martin Grajcar wrote:

Hi Vladimir,

below I'm mostly talking to myself... you know learning by writing. It'd be nice if you could find something useful therein. On Thu, Feb 13, 2014 at 1:18 AM, Vladimir Kozlov <vladimir.kozlov at oracle.com <mailto:vladimir.kozlov at oracle.com>> wrote: Hi Martin, The issue is more complicated than I thought. The code I pointed before was added by me about 3 years ago for: 7097546: Optimize use of CMOVE instructions https://bugs.openjdk.java.net/_browse/JDK-7097546 <https://bugs.openjdk.java.net/browse/JDK-7097546> Changes were done to avoid 2x performance hit with cmov for code like next: public static int test(int result, int limit, int mask) { // mask = 15 for (int i = 0; i < limit; i++) { if ((i&mask) == 0) result++; // Non frequent } return result; } Cmov instruction has big flow - it requires an additional register.

I think you could save one register and two instructions. The generated code for the conditional increment seems to use BlockLayoutByFrequency and looks like mov %ebp,%r8d and %ebx,%r8d test %r8d,%r8d # not used anymore je 0x00007fafdf77d907 while simply test %ebp,%ebx je 0x00007fafdf77d907 We have such matching but only if constant or memory is used instead of register (%ebx) in such case: match(Set cr (CmpI (AndI src con) zero)); It is oversight and very easy to fix. I will file RFE. should do. I can imagine this doesn't fit into the intermediate representation used, but maybe it could be handled when mapping into instructions gets done? As this works for AND only, it may be too restrictive for real world examples. After switching BlockLayoutByFrequency off, the loop body is essentially this chunk of code repeated 16 times (with oldresult and newresult switching between iterations): mov %i, %tmp add $increment, %tmp and %mask, %tmp mov %oldresult, %newresult inc %newresult test %tmp,%tmp cmovne %oldresult,%newresult Even when looking at the xxxresult only, each chunk seems to need two cycles, because of data dependencies (assuming mov+inc get fused into a single 3-operand microinstruction). This dependency could be eliminated, but there are still 7 instructions which is a lot. Now I can see how branching out helps. By using LEA the instruction count can be reduced to 5, but without any visible speedup in my test. Can add with carry be used? mov %i, %tmp add $increment, %tmp and %mask, %tmp add $-1, %tmp adc $0,%result This works only for counting non-zeros (or zeros after a simple transform or with SBB), but counting is pretty common. In my stupid hand-made assembly there's a nice speedup (from 0.7 ns/iteration to 0.4). Very interesting suggestion. We already doing something similar to that: // Check for simple conditional add pattern: "(P < Q) ? X+Y : X;" // To be profitable the control flow has to disappear; there can be no other // values merging here. We replace the test-and-branch with: // "(sgn(P-Q))&Y) + X". Basically, convert "(P < Q)" into 0 or -1 by // moving the carry bit from (P-Q) into a register with 'sbb EAX,EAX'. // Then convert Y to 0-or-Y and finally add. We can add (P != 0) case. If loop's body is complex, using cmov will result in a register spilling - additional instructions. The performance hit could be high than branch misprediction. I guess there's no way to find it out early enough? It could be done since selection of cmov instructions is done when we know about loop body, at least how many instructions in it. I am not sure how to proceed from here. I may do some benchmark testing to see affects if cmov is used in more cases. I can see that CMOV is more expensive than I thought. Maybe after some benchmarking the threshold can be shifted a bit. For my benchmark, it should ideally be no maybe 5%, 10% would be acceptable and 18% are very bad. 20% was selected based on our benchmarks runs. It affects general code blocks ordering and not just simple conditional increments. It is well known that you always can find the case which behave terrible with current VM settings. We can't satisfy everyone. Saying all that, your code optimization suggestions are very good and may help you even with current settings. I guess I know what's the biggest difference between my and your benchmarks: predictability. In the snippet above, you can happily jump around as it's nearly free. In mine, you can't. I updated my benchmark with double nextDouble = predictable ? (0.5 + sb.length() % 20) / 20 : random.nextDouble(); boolean shouldMatch = nextDouble < fractionMatching; and got these results: https://microbenchmarks.appspot.com/runs/d003427e-5a94-45b9-8a06-fb17d344ec17#r:scenario.benchmarkSpec.parameters.fractionMatching&c:scenario.benchmarkSpec.parameters.predictable Does the JVM collect any stats about branch predictability? We only collect counters how many times branch bytecode is executed and how many times branch is taken. Based on these counts we calculate probability. Regards, Vladimir Regards, Martin.



More information about the hotspot-compiler-dev mailing list