explode (original) (raw)

Brian Goetz brian.goetz at oracle.com
Wed Feb 6 14:32:28 PST 2013


Guys, we need to close on the open Stream API items relatively soon. Maybe we're almost there on flatMap.

Of the alternatives for flatMap below, I think while Alternative B is attractive from a client code perspective, I think Alternative A is less risky with respect to stressing the compiler (and also introduces fewer new types.)

So, semi-concrete proposal:

Stream flatMapToCollection(Function<T, Collection>) Stream flatMapToArray(Function<T, U[]>) // do we even need this? Stream flatMap(Function<T, Stream>) Stream flatMap(FlatMapper<T, U>)

where

interface FlatMapper<T, U> { void explodeInto(T t, Consumer consumer); }

with specializations for primitives:

IntStream flatMap(FlatMapper.OfInt) ... etc

We can then position flatMap as the "advanced" version, so from a "graduated learning" perspective, people will find fMTC first, if that meets their needs, great, and the Javadoc for fMTC can guide them to fM for the more advanced cases.

On 2/4/2013 3:37 PM, Brian Goetz wrote:

From this, here's what I think is left to do: - More work on explode needed > ... Circling back to this. Clearly explode() is not done. Let me try and capture all the relevant info in one place. Let's start with some background. Why do we want this method at all? Well, it's really useful! It's fairly common to do things like: Stream orders = ... Stream lineItems // explicit declaration for clarity = orders.explode(... order.getLineItems() ...) and it is often desirable to do then streamy things on the stream of line items. Those who have used flatMap in Scala get used to having it quite quickly, and would be very sad if it were taken away. Ask Don how many examples in his katas use it. (Doug will also point out that if you have flatMap, you don't really need map or filter -- see examples in CHM -- since both can be layered atop flatMap, modulo performance concerns.) (It does have the potential to put some stress on the system when an element can be mapped to very large collections, because it sucks you into the problem of nested parallelism. (This is the inverse of another problem we already have, which is when filter stages have very high selectivity, and we end up with a lot of splitting overhead for the number of elements that end up at the tail of the pipeline.) But when mapping an element to a small number of other elements, as is common in a lot of use cases, there is generally no problem here.) Scala has a method flatMap on whatever they'd call Stream, which takes a function T -> Stream and produces a Stream. More generally, this shape of flatMap applies (and is supported by higher-kinded generics) to many traits in the Scala library, where you have a Foo[A] and the flatMap method takes an A -> Foo[B] and produces a Foo[B]. (Our generics can't capture this.) This is the shape for flatMap that everyone really wants. But here, we run into unfortunate reality: this works great in functional languages with cheap structural types, but that's not Java. I took it as a key design goal for flatMap/mapMulti/explode that: if an element t maps to nothing, the implementation should do as close to zero work as possible. This rules out shaping flatMap as: Stream flatMap(Function<T, Collection>) because, if you don't already have the collection lying around, the lambdas for this are nasty to write (try writing one, and you'll see what I mean), inefficient to execute, and create work for the library to iterate the result. In the limit, where t maps to the empty collection, creating an iterating an empty collection for each element is nuts. (By contrast, in languages like Haskell, wrapping elements with lists is very cheap.) However, the above shape is desirable as a convenience in the case you already do have a collection lying around. So let's put it in the bin of "nice conveniences to also deliver when we solve the main problem." It also rules out shaping flatMap as: Stream flatMap(Function<T, Stream>) because that's even worse -- creating ad-hoc streams is even more expensive than creating collections. To simplify, imagine there are two use cases we have to satisfy: - map element to generator (general case) - map element to collection (convenience case) The other cases (to array, to stream) are similar enough to the collection case. To illustrate the general "generator" case, here's an example of a lambda (using the current API) that takes a Stream of String and produces a Stream of Integer values which are the characters of that stream: (sink, element) -> { for (int i=0; i<element.length(); i++)_ _sink.send(element.charAt(i));_ _}_ _It's efficient (no input, no output) and pretty easy to see what's going_ _on. Would be nicer if we could spell "sink.send" as "yield", but oh_ _well. Here's how we'd have to write that if we didn't support the_ _generator case:_ _(element) -> { ArrayList list = new ArrayList<>(); for (int i=0; i<element.length(); i++)_ _list.add(element.charAt(i));_ _return list;_ _}_ _Bigger and less efficient. And it gets uglier if we want to try and_ _optimize away the list creation in the empty case:_ _(element) -> { if (element.length() == 0) return Collections.emptyList(); ArrayList list = new ArrayList<>(); for (int i=0; i<element.length(); i++)_ _list.add(element.charAt(i));_ _return list;_ _}_ _We're really starting to lose sight of what this lambda does. (Hopefully_ _this will put to bed the notion that all we need is the T->Collection case.) Erasure plays a role here too. Ideally, it would be nice to overload methods for flatMap(Function<T, Collection>) flatMap(Function<T, U[]) but obviously we can't do that (directly).

The original API had only: Stream flatMap(MultiFunction<T,U> mf) where MultiFunction was (T, Consumer) -> void. If users already had a Collection lying around, they had to iterate it themselves: (element, sink) -> { for (U u : findCollection(t)) sink.accept(u); } which isn't terrible but people didn't like it -- I think not because it was hard to read, but hard to figure out how to use flatMap at all. The current iteration provides a helper class with helper methods for handling collections, arrays, and streams, but you still have to wrap your head around why you're being passed two things before doing anything -- and I think its the "before doing anything" part that really messes people up. So, here's two alternatives that I hope may be better (and not run into problems with type inference). Alternative A: overloading on method names. // Map T -> Collection public StreamA explodeToCollection(Function<T, Collection> mapper); // Map T -> U[] public StreamA explodeToArray(Function<T, U[]> mapper); // Generator case -- pass a T and a Consumer public StreamA explodeToConsumer(BiConsumer<T, Consumer> mapper); // Alternate version of generator case -- with named SAM instead public StreamA altExplodeToConsumer(Exploder<T, U> mapper); interface Exploder<T, U> { void explode(T element, Consumer consumer); } Here, we have various explodeToXxx methods (naming is purely illustrative) that defeat the erasure problem. Users seeking the T->Collection version can use the appropriate versions with no problem. When said users discover that their performance sucks, they have motivation to learn to use the more efficient generator version. Usage examples: StreamA a1 = a.explodeToArray(i -> new Integer[] { i }); StreamA a2 = a.explodeToCollection(i -> Collections.singleton(i)); StreamA a3 = a.explodeToConsumer((i, sink) -> sink.accept(i)); Alternative B: overload on SAMs. This involves three SAMs: interface Exploder<T, U> { void explode(T element, Consumer consumer); } interface MapperToCollection<T, U> extends Function<T, Collection> { } interface MapperToArray<T, U> extends Function<T, U[]> { } And three overloaded explode() methods: public StreamB explode(MapperToCollection<T, U> exploder); public StreamB explode(MapperToArray<T, U> exploder); public StreamB explode(Exploder<T, U> exploder); Usage examples: StreamB b1 = b.explode(i -> new Integer[] { i }); StreamB b2 = b.explode(i -> Collections.singleton(i)); StreamB b3 = b.explode((i, sink) -> sink.accept(i)); I think the second approach is pretty decent. Users can easily understand the first two versions and use them while wrapping their head around the third.



More information about the lambda-libs-spec-observers mailing list