Collectors update redux (original) (raw)

Tim Peierls tim at peierls.net
Thu Feb 7 10:41:10 PST 2013


Is three-arg collect really the target "on ramp"? I would have thought the first stop would be the combinators. OTOH ... there's a lot of stuff in there. I can think of uses for all of it, but I worry about someone faced with picking the right static factory method of Collectors. Maybe with the right class comment, users can be guided to the right combinator without having to know much.

--tim

On Wed, Feb 6, 2013 at 5:12 PM, Brian Goetz <brian.goetz at oracle.com> wrote:

Did more tweaking with Collectors.

Recall there are two basic forms of the collect method: The most basic one is the "on ramp", which doesn't require any understanding of Collector or the combinators therein; it is basically the mutable version of reduce. It looks like: collect(() -> R, (R,T) -> void, (R,R) -> void) The API shape is defined so that most invocations will work with method references: // To ArrayList collect(ArrayList::new, ArrayList::add, ArrayList::addAll) Note that this works in parallel too; we create list at the leaves with ::add, and merge them up the tree with ::addAll. // String concat collect(StringBuilder::new, StringBuilder::append, StringBuilder::append) // Turn an int stream to a BitSet with those bits set collect(BitSet::new, BitSet::set, BitSet::or) // String join with delimiter collect(() -> new StringJoiner(", "), StringJoiner::append, StringJoiner::append) Again, all these work in parallel. Digression: the various forms of reduce/etc form a ladder in terms of complexity: If you understand reduction, you can understand... ...reduce(T, BinaryOperator) If you understand the above + Optional, you can then understand... ...reduce(BinaryOperator) If you understand the above + "fold" (nonhomogeneous reduction), you can then understand... ...reduce(U, BiFunction<U, T, U> accumulator, BinaryOperator); If you understand the above + "mutable fold" (inject), you can then understand... ...collect(Supplier, (R,T) -> void, (R,R) -> void) If you understand the above + "Collector", you can then understand... ...collect(Collector) This is all supported by the principle of commensurate effort; learn a little more, can do a little more. OK, exiting digression, moving to the end of the list, those that use "canned" Collectors. collect(Collector) collectUnordered(Collector) Collectors are basically a tuple of three lambdas and a boolean indicating whether the Collector can handle concurrent insertion: Collector<T,R> = { () -> R, (R,T) -> void, (R,R) -> R, isConcurrent } Note there is a slight difference in the last argument, a BinaryOperator rather than a BiConsumer<R,R>. The BinaryOperator form is more flexible (it can support appending two Lists into a tree representation without copying the elements, whereas the (R,R) -> void form can't.) This asymmetry is a rough edge, though in each case, the shape is "locally" optimal (in the three-arg version, the void form supports method refs better; in the Collector version, the result is more flexible, and that's where we need the flexibility.) But we could make them consistent at the cost of the above uses becoming more like: collect(StringBuilder::new, StringBuilder::append, (l, r) -> { l.append(r); return l; }) Overall I think the current API yields better client code at the cost of this slightly rough edge.

The set of Collectors now includes: toCollection(Supplier<**Collection>) toList() toSet() toStringBuilder() toStringJoiner(delimiter) // mapping combinators (plus primitive specializations) mapping(T->U, Collector) // Single-level groupBy groupingBy(T->K) // groupBy with downstream Collector) groupingBy(T->K, Collector) // grouping plus reduce groupingReduce(T->K, BinaryOperator) // reduce only groupingReduce(T->K, T->U, BinaryOperator) // map+reduce // join (nee mappedTo) joiningWith(T -> U) // produces Map<T,U> // partition partitioningBy(Predicate) partitioningBy(Predicate, Collector) partitioningReduce(Predicate<**T>, BinaryOperator) partitioningReduce(Predicate<**T>, T->U, BinaryOperator) // statistics (gathers sum, count, min, max, average) toLongStatistics() toDoubleStatistics() Plus, concurrent versions of most of these (which are suitable for unordered/contended/forEach-**style execution.) Plus versions that let you offer explicit constructors for maps and collections. While these may seem like a lot, the implementations are highly compact -- all of these together, plus supporting machinery, fit in 500 LoC. These Collectors are designed around composibility. (It is vaguely frustrating that we even have to separate the "with downstream Collector" versions from the reducing versions.) So they each have a form where you can do some level of categorization and then use a downstream collector to do further computation. This is very powerful. Examples, again using the familiar problem domain of transactions: class Txn { Buyer buyer(); Seller seller(); String description(); int amount(); } Transactions by buyer: Map<Buyer, Collection> m = txns.collect(groupingBy(Txn::**buyer)); Highest-dollar transaction by buyer: Map<Buyer, Transaction> m = txns.collect( groupingReduce(Txn::buyer, Comparators.greaterOf( Comparators.comparing(Txn::**amount))); Here, comparing() takes the Txn -> amount function, and produces a Comparator; greaterOf(comparator) turns that Comparator into a BinaryOperator that corresponds to "max by comparator". We then reduce on that, yielding highest-dollar transaction per buyer. Alternately, if you want the number, not the transaction: Map<Buyer, Integer> m = txns.collect(groupingReduce(**Txn::buyer, Txn::amount, Integer::max)); Transactions by buyer, seller: Map<Buyer, Map<Seller, Collection>> m = txns.collect(groupingBy(Txn::**buyer, groupingBy(Txn::seller))); Transaction volume statistics by buyer, seller: Map<Buyer, Map<Seller, LongStatistics>> m = txns.collect(groupingBy(Txn::**buyer, groupingBy(Txn::seller, mapping(Txn::amount, toLongStatistics()))); The statistics let you get at min, max, sum, count, and average from a single pass on the data (this trick taken from ParallelArray.) We can mix and match at various levels. For example: Transactions by buyer, partitioned int "large/small" groups: Predicate isLarge = t -> t.amount() > BIG; Map<Buyer, Map<Boolean, Collection>> m = txns.collect(groupingBy(Txn::**buyer, partitioningBy(isLarge))); Or, turning it around: Map<Boolean, Map<Buyer, Collection>> m = txns.collect(partitioningBy(**isLarge, groupingBy(Txn::buyer))); Because Collector is public, Kevin can write and publish Guava-multimap-bearing versions of these -- probably in about ten minutes.



More information about the lambda-libs-spec-observers mailing list