type-recursive inlining oddity (original) (raw)

Remi Forax forax at univ-mlv.fr
Mon Sep 10 10:25:21 PDT 2012


On 09/10/2012 06:20 PM, John Rose wrote:

On Sep 10, 2012, at 5:08 AM, Aleksey Shipilev wrote:

What had happened in chain11? We compile the v-call in DelegatingIterator1.hasNext(); it appears that the receivers in i.hasNext() are either DelegatingIterator1 or EmptyIterator; Yes, I think that's the main factor. Dynamically monomorphic calls sites are cheapest, followed by dimorphic sites. There is a somewhat surprising bimorphic call in chain11. Meanwhile, chain12, though superficially more complex, is purely monomorphic. Let's lay the actual call graphs side by side: chain11: D1 -> D1 -> E chain12: D1 -> D2 -> E (I eliminated ::hasNext everywhere and abbreviated the class names.) But in terms of the profile-carrying bytecode we have to merge duplicate call graph nodes: chain11: D1 -> { E, D1 } chain12: D1 -> D2 -> E The edges D1->E and D1->D1 compete with each other, but each seems to happen 50% of the time, so the compiler will treat each edge as hot. so we devirtualize this bimorphic call, inline both; then, we proceed with processing v-calls in new inlined methods, and voila, there is the v-call for i.hasNext() in the same method again, so we devirtualize and inline further, until MaxRecursiveInlineLevel tells us to stop. The node merging eliminates profile information, so that when the inline re-splits the nodes, we have to duplicate the profile: chain11: D10 -> { E1, D11 -> { E2, D12 -> … } } chain12: D1 -> D2 -> E Dynamically, D10 -> E1 does not happen, nor does D11 -> D12, but the compiler puts in tests for both, because it cannot prove that those edges will never happen, and the profile suggests that they in fact do happen, 50% of the time. The special heuristics for recursion (MaxRecursiveInlineLevel) help cut off the code expansion, which is useful here, since there is almost no recursion happening (D10 -> D11 is the only self edge). Basically, the inliner is trying to balance the possibility of a real recursion (requiring the equivalent of loop unrolling) with spurious profile-induced recursion (requiring a quick cutoff and hoping that the hardware prefetcher hides the compiler's shame). In this case, the real situation is somewhere between. So there's profiler inaccuracy, plus some odd heuristics triggered by self-calls. The result is "one size fits all" code, which in this case is not a good fit. We could fix this particular problem by cloning the node D1 before profiling: Basically, recognize that D1 is self-recursive, and then split out a copy of D1 to be called from itself. This is a call-graph equivalent of loop peeling, which is wonderfully effective in many use cases. (Loop peeling is more robust, since it works after inlining. To be more closely equivalent to loop peeling, the cloning of D1 should be triggered whenever a cycle back to D1 is found in the call graph reachable from D1.) We have experimented with splitting profiles (not the whole bytecode), but the results have not been promising yet. It's worth tinkering with, though. Bytecode (method) cloning might be worthwhile too, if we can tune the heuristics properly.

You don't need bytecode cloning if you can have several metadata objects for one methods. BTW, I currently do bytecode cloning because there is no way to clear all mdos of one method.

About the splitting profiles, we know where we need splitting profiles, it's when a callsite is said to be polymorphic by the interpreter. In that case, c1 should generate a code with a special entry point (a third one) that takes a metadata object as parameter from the callsite. With that we have a way to create a profile tree that can be used by c2 to generate a dedicated code specialized for a callpath.

One more comment: I think what we are seeing here is an early indication of the most important optimization challenge of lambdas, which I call loop customization. I will send a separate note about this.

I fully agree and I don't see a way to be able to do that without introducing a method handle combiner for loops.

— John

Rémi



More information about the hotspot-compiler-dev mailing list