Allowing wrapping/transforming of the final result of Stream.collect() (original) (raw)

Brian Goetz brian.goetz at oracle.com
Tue Apr 16 06:55:42 PDT 2013

Previous message: Allowing wrapping/transforming of the final result of Stream.collect()
Next message: Allowing wrapping/transforming of the final result of Stream.collect()
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

AFAICT Stream's into() has been completely superceded by collect(), and Stream.Destination does not exist in b83.

Correct.

Does an equivalent now exist with collect()? I could not find an obvious way to do this, as both accumulator and combiner are expected to operate on elements or partial results while collecting is in progress, but are not given the chance to transform the final result.

This is correct.

If there is no equivalent, are there plans to add one? It would be a shame to miss this out, when I imagine it would be well used, not just for immutable collections, but also their cousins, the unmodifiable wrappers, as well as synchronized wrappers, and creating sets from maps (Collections.newSetFromMap, a common way to obtain a concurrent hash set[1]).

Agreed that it is a shame. We spent a lot of time investigating this, and it added more complexity than it first appeared, so we retreated to the current state.

The easy case is easy: you want to apply a transform to the final result. As in:

StringBuilder sb = stuff.map(Object::toString) .collect(toStringBuilder()); String result = sb.toString();

Here, the Collector does not return what you actually want. Of course, in this case, its easy to do it unobstrusively:

String result = stuff.map(Object::toString) .collect(toStringBuilder()) .toString();

But there are other cases where you want to do slightly more, and would like to apply a function to the result. For example, its easy to compute average with an array of two longs:

long[] raw = stuff.collect(() -> new long[2], (la, x) -> { la[0] += x; la[1]++; }, (la, lb) -> { la[0] += lb[0]; la[1] += lb[2]; }); double avg = (raw[1] > 0) ? (double) raw[0] / raw[1] : 0.0;

Here, its not quite so easy to just say .toDouble() on the result, so you have to do an extra step. Really, what you want to be able to to do is roll both into a Collector, where the internal state (array of longs) is hidden and only the result is exposed. We get that.

Where this falls apart is when you want to do this as the downstream reduction in a composed reduction, such as "group transactions by salesman and compute average sale". You want to get a Map<Salesman, Double> rather than a Map<Salesman, long[]>. Right?

But, this starts to get pretty messy. At the top level, you know when you're done -- because you're out of input. So there's an obvious and efficient time to apply the post-transform and just return that. But at the next level down, you don't know whether there are more values associated with a key coming. So you have to instantiate something like a Map<Salesman, long[]>, and then transform it into a Map<Salesman, Double>. And if you have a three-level groupBy going on, it gets worse.

This is messy. You have two choices: create a new Map (potentially hugely expensive) or create a view map (potentally hugely memory wasteful, potentially CPU-wasteful if an expensive transform has to be recomputed for multiple gets of the same key, and potentially moves a significant fraction of the computation until after the user thinks it is over.) And the library doesn't have the data with which to choose sensibly between these approaches (you need to know how expensive the "before" in-memory representation is, and how expensive the transform is, and make tradeoffs between them.)

Further, the "single post-transform" is also likely to be a sequential bottleneck. Since you don't apply the post transform to the leaves of the computation tree, but only the root, the obvious approach leads you to doing it sequentially. Which likely kills any parallelism you would have gotten.

None of these are insurmountable engineering problems, but extending the Collector API to support all these cases takes what is a mostly simple API and turns it into something much uglier. We spent a lot of time exploring this and did not come up with something that was acceptable. Given that the user has far more information on what the right choice is than the library does, it made sense to just let the user handle this, despite the regrettable loss of fluency. But making the Collector API far more complicated seemed worse.

Another pattern that I use is to take an immutable Collection and wrap it with my own type to provide methods in the domain language, e.g.:

Trades trades = new Trades(Arrays.asList(trade1, trade2, ...)); ... Date dateOfEarliest = trades.earliest().getTradedAtDate(); where the implementation of earliest() internally uses collection-like operations (e.g. min operation with a comparator) but hides them behind a method expressed in the domain language. I would like, but am currently unable, to produce a new Trades() instance at the end of the collect() method. The point of this example is that while we can talk of unmodifiable and synchronized wrappers, etc, which could be provided in the JDK, this is a use case that would be impossible for the JDK to provide.

Right. You collect() to a Collection and then wrap with a Trades. You could do it like this:

Collection c = stream...collect(); Trades t = new Trades(c);

Trades t = new Trades(stream...collect());

Stream s = stream...stuff... Trades t = new Trades(s.collect(...));

- there's already several forms of collect(), another version adding a method with another parameter to provide the final conversion would spend some of the complexity budget

It's more than that. Having a form

<R, Z> Z collect(Collector<T, R> c, Function<R,Z> f)

is not so bad -- if it had a lot of value, I'd surely consider one more form. And the implementation is obviously trivial. But where's the value? It just saves the caller from having to do:

Z z = f.apply(r);

after the collect. And it will be sequential, even on a parallel pipeline, which may be surprising to some users.

Basically:

In the cases where this works, its trivial for the caller to do it themselves;
In the cases where it's not trivial for the caller to do it themselves, it requires a great deal of API complexity to capture all the possibilities.

Previous message: Allowing wrapping/transforming of the final result of Stream.collect()
Next message: Allowing wrapping/transforming of the final result of Stream.collect()
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the lambda-dev mailing list