Fwd: Forms for reduce() -- part 2 (original) (raw)

Brian Goetz brian.goetz at oracle.com
Thu Jan 3 14:32:50 PST 2013

Previous message: Generators
Next message: random streams
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This got caught in the filters -- its old and probably out of date, but should go onto the record.

-------- Original Message -------- Subject: Forms for reduce() -- part 2 Date: Mon, 10 Dec 2012 16:41:38 -0500 From: Brian Goetz <brian at briangoetz.com> To: lambda-libs-spec-experts at openjdk.java.net <lambda-libs-spec-experts at openjdk.java.net>

This second note is about the proposed overhaul of groupBy/reduceBy. I've already outlined why I don't like these methods -- they both tie the API to Map/Collection, while at the same time giving users relatively limited ability to handle more than a few simple cases.

We don't need groupBy/reduceBy, since groupBy is just reduceBy (with the row reducer being reduce(ArrayList::new, ArrayList::add)) and reduceBy is just reduce, with a suitably complicated implementation of the reducer functions. But we don't want users to have to build their own groupBy/reduceBy out of pure reduce -- it's too much work, too error-prone, etc. We should provide canned versions. The current versions are an example of what we might do, but I think we can do much better.

There is also the issue that I don't think we've correctly surfaced the issues surrounding when encounter order must be maintained, and when we can relax these for better performance. Currently, we guess whether a stream has a defined encounter order or not based on its source; Lists do, non-sorted-Sets don't. But this is only indirectly coupled to whether the user cares about this order or not. The "unordered()" hack is helpful, but its a hack. The second factor here is associativity. If your reducing function is merely associative, which is the only reasonable assumption, then you MUST care about order. But many are also commutative (sum, min, max). The user knows this, but the framework currently does not, so it has to be conservative.

The two ways to do a mutable reduce are:

Use mutable containers for each leaf, and merge them as you go up the tree;
Use a single, shared, concurrent mutable container (like a CHM.)

IF you don't care about encounter order, OR your reduction is commutative, you can use the latter approach -- which is more like a forEach than a reduce.

Currently some ops have two implementations, a merging one and a contenting one, keyed off of ORDERED. This is both complex for us to maintain and also not always what the user wants. Better to let the user just say.

Since all problems are solved by another layer of indirection, I will propose a new abstraction -- "tabulation". Tabulation is the process of creating a structured description of an input stream. Tabulations could describe:

groupBy -- create a Map<U,Collection> given a classifying function T->U
reduceBy -- create a Map<U,V> given a classifying function T->U and a reducer that reduces a stream of T to V
partition -- partition elements into two lists/arrays/etc based on some predicate
materialized function application / joins -- given a function T->U, produce a Map<T,U> whose keys are the values the stream and whose values are the result of applying the function
more sophistication combinations of the above, such as a two-level groupBy or reduceBy (Map<Buyer, Map<Seller, Collection>)

The latter cannot be handled at all with our current tools, nor can the current tools produce a Guava Multimap instead of a HashMap, or do Don's "groupByMulti". The only benefit to the current groupBy/reduceBy tools is that they automate some of the mechanics of producing fake multimaps. And even this advantage mostly goes away with the addition of Map.{merge,compute,computeIfAbsent}.

Use cases -- tabulation

Given the following domain model, here are some use cases for tabulation:

class Document { Author author(); Editor editor(); int pages(); }

Group documents by author -- given a Stream, produce a MapLike<Author, CollectionLike>. Here, the user may want control over what kind of MapLike and what kind of CollectionLike to use.
Group documents by author and editor -- given a Stream, produce a Map<Author, Map<Editor, Collection>.
Partition documents into collections of "shorter than 50 pages" and "longer than 50 pages".
Count of documents by author -- produce a Map<Author, Integer>.
Longest doc by author -- produce a Map<Author, Document>.
Sum of pages by author -- produce a Map<Author, Integer>.
Authors of documents joined with additional data -- produce a Map<Author, V> for some function Author->V.

Our current stuff can do (1), (4), (5), and (6), but not (2), (3), or (7) easily.

Let's assume we have an interface that describes Reducer<T,R> (takes a set of T values and produces an R.)

Then we can create canned reducers, such as the Average from the previous note:

static final Reducer<Integer,Double> AVERAGE = reducer(...)

Then we define:

interface MutableTabulator<T,R> extends Reducer<T,R> { }

And we define combinators for tabulators like:

 // traditional groupBy
 // will want versions to default to HashMap/ArrayList
 static <T, U,
         C extends Collection<T>,
         M extends Map<U, C>>
 Tabulator<T, M>
 groupBy(Function<T,U> classifier,
         Supplier<M> mapFactory,
         Supplier<C> rowFactory) { ... }

 // nested groupBy
 static <T, U, D,
         M extends Map<U, D>>
 Tabulator<T, M>
 groupBy(Function<T,U> classifier,
         Supplier<M> mapFactory,
         Tabulator<T, D> downstreamTabulator) { ... }

 // Take a Stream<T> and a Function<T,U> and create Map<T,U>
 static <T, U, M extends Map<T,U>>
 Tabulator<T, M>
 mappedTo(Function<T,U> mapper, Supplier<M> mapFactory) { ... }

 // What we were calling reduceBy
 // Will want other reduce variants
 static<T, U, I,
        M extends Map<U,I>>
 Tabulator<T,M>
 groupedReduce(Function<T,U> classifier,
               Supplier<I> baseFactory,
               BiBlock<I, T> acceptElement,
               BiBlock<I, I> combiner) { }

These are easy to define and users can define their own. We'll have a dozen or two tabulator factories and combinators, not so bad, and if we forget one, no worry. Guava can easily define a groupBy that groups to a MultiMap. Etc.

So, our use cases using these:

Group documents by author

Map<Author, List> m = docs.tabulate(groupBy(Document::author, HashMap::new, ArrayList::new));

Group documents by author and editor

Map<Author, Map<Editor, List>> m = docs.tabulate(groupBy(Document::author, HashMap::new, groupBy(Document::editor));

Partition documents into collections of "shorter than 50 pages" and "longer than 50 pages"

List<List> l = docs.tabulate(partitionBy(d -> d.pages() <= 50), ArrayList::new, ArrayList::add);

Count of documents by author -- produce a Map<Author, Integer>.

Map<Author, Integer> m = docs.tabulate(groupedReduce(Document::author, () -> 0, (s, d) -> s + 1 Integer::plus));

Longest doc by author -- produce a Map<Author, Document>.

Map<Author, Integer> m = docs.tabulate(groupedReduce(Document::author, (d1, d2) -> (d1.pages() >= d2.pages()) ? d1 : d2, (d1, d2) -> (d1.pages() >= d2.pages()) ? d1 : d2)

Sum of pages by author -- produce a Map<Author, Integer>.

Map<Author, Integer> m = docs.tabulate(groupedReduce(Document::author, () -> 0, (s, d) -> s + d.pages() Integer::plus));

Authors of documents joined with additional data -- produce a Map<Author, V> for some function Author->V.

Map<Customer, Integer> m = documents.map(Document::author).uniqueElements() .tabulate(mapJoin(a -> f(a));

Overall, I think these forms are fairly usable. There are a pile of factories and combinators for reducers/tabulators, which can be combined to do whatever you want. Users can create new ones, or can compose them into canned tabulators.

The last bit is the separation of fold-style functional merging from contention-based "toss it in a big ConcurrentHashMap." I think the answer here is to have two forms of tabulators. Let's call them Tabulator and ConcurrentTabulator for sake of discussion. For each canned tabulator form, there'd be a merging and a concurrent form. Now the choice is in the user's hands; if they don't care about encounter order or have a better-than-associative combining function, they can select the latter, which is more like a forEach than a reduce.

It sounds like a lot of new surface area, but it really isn't. We need a few forms of reduce, an abstraction for Reducer, an abstraction for Tabulator, and some factories and combinators which are individually trivial to write. It move a lot of the "can't define your own stream ops" problems into a domain where users can define their own reducers and tabulators.

Previous message: Generators
Next message: random streams
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the lambda-libs-spec-observers mailing list