perspectives on streams performance (original) (raw)

Aggelos Biboudis biboudis at gmail.com
Tue Mar 10 18:22:05 UTC 2015


Hi John,

For this investigation someone could find useful - as a starting-point - the high-level performance assessment we recently did - http://biboudis.github.io/clashofthelambdas/ - and put a magnifying glass over the Java cases. We have 4 simple JMH’d micro-benchmarks (and their corresponding for-based ones as baselines) that could be enhanced. Afterwards, the equivalence between JIT-ted streams and hand-optimized for-loops could be evaluated. The cases that are missing are mostly type-polluted pipelines that will help us to understand the costs of deoptimization.

Kind regards, Aggelos Biboudis.

On Fri, Mar 6, 2015 at 3:01 AM, John Rose <john.r.rose at oracle.com> wrote:

In order to get the full benefit from JDK 8 streams we will need to make them optimize fully. Here are a few thoughts about that.

I think of streams as a more concise and orderly replacement of classic "for" loops. Every stream can be rewritten as one or more for-loops, at the cost of verbosity and commitment to hard-coding optimizations (like FJ). A classic "for" loop is a external iterator notation: The iteration machinery is outside of (lexically around) the data structure access. This notation is at least as old as Fortran. Streams are an internal iteration notation: The iteration machinery (crucially, the looping part of the algorithm) is inside the data structure, and only the loop body appears in the user code (as a lambda). This notation is also old, found in Lisp and Smalltalk. External iterators are easier to optimize, because their crucial iteration logic is always inlined into the specific loop request as coded by the user. (It has to be, because the user writes it explicitly.) Existing compilers, like HotSpot's, are good at optimizing "for" loops. HotSpot are less good at internal iterators. If the original point of the user request fails to inline all the way into the internal looping part of the algorithm (a hidden "for" loop), the quality of the loop will be very poor. (Exception: With micro-benchmarks, the loop quality can be partially recovered by relying on a pure, clean profile. But with real system code, the hidden "for" loop will have a polluted profile.) This is the problem I have referred to as "loop customization", even though it applies to non-looping code as well (as long as there is a template that needs expanding in order to gain performance). If streams are to perform at peak, we need to be able to connect the user request (where the user would have written a classic "for" loop, but chose a stream instead) to the expansion, customization, and optimization of the hidden loop. (Note that the hidden loop may run in a different thread! This defeats the usual forms of inlining.) Somehow the conditions of the user request need to be communicated to the code that actually does the work. The code that actually runs the loop must be customized to take into account whatever "syndrome" of template parameters (such as closure bodies or operand types) that are critical to optimizing the loop code. There are two natural scales of optimization: Per-chunk (sped up by multi-threading) and per-loop-body (sped up by vectorization and unrolling). Out of the box, the "parallel" modifier gives good multi-threading. But I am afraid that the loop body optimizations is less well behaved, for streams. I would like to encourage any interested colleagues to examine streams performance under the following conditions: 1. Head-to-head comparison of "for" loops and equivalent streams. 2. Proper vectorization of both forms of loops, at least for arraywise elemental operations, searches, and reductions. This would include issuing the best vectorizing instructions available for the platform (I'm thinking Haswell, etc.). 3. Benchmark management which operates multiple loop examples per JVM, to simulate realistically "dirty" hidden-loop profiles. 4. Artificial suppression of inlining from the request point (of a stream-based loop) to the algorithm's hidden loop, again to simulate realistically the compilation of loop kernels to run in multiple threads. All of the above examples are focused on measuring and improving per-loop-body optimizations (vectorization and other loop transforms). None of them need to be run with FJ or multiple threads. The JMH framework would be very useful for running the tests. It may be that after a head-to-head comparison we will find that the HotSpot optimizer is better than I'm giving it credit for. I have not made these studies myself. But my usual experience is that rocks like this, when you turn them over, have something "interesting" under them. None of this is urgent. I'm putting it out as a possibly interesting project for people to collaborate on. — John -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20150310/ad0c5478/attachment-0001.html>



More information about the hotspot-compiler-dev mailing list