Spliterator (original) (raw)

Brian Goetz brian.goetz at oracle.com
Fri Nov 16 08:40:31 PST 2012

Previous message: Fwd: Re: Stream.limit() - puzzler/bug/feature
Next message: Spliterator
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Following up...we've eliminated StreamAccessor from the design, so you only need a Spliterator (and maybe stream flags) now to create a stream. Here's the latest API and spec for Spliterator. Some explanations of the choices are in the original mail to which this is a reply.

/**

Provides the ability to decompose an aggregate data structure and to iterate
over the elements the elements of the aggregate.
Use of a Spliterator has two phases; splitting and traversal. The splitting
phase divides the elements of the data structures so that they may be
processed concurrently. The split phase is intended to be repeated
recursively until the number of elements per split reaches a threshold where
further splitting would only generate additional cost. Most methods of
{@code Spliterator} are applicable primarily to one phase or the other.
The traversal phase allows you to obtain access to the elements of the current
split, and begins on first call to the {@code iterator()} or {@code forEach()}
methods.
@author Brian Goetz
@author Doug Lea */ public interface Spliterator {

/**
- Return the number of calls to {@code split()} that will

naturally divide * the data structure. Each of the splits may suggest additional splitting. * The result of calling this method during the traversal phase is unspecified. *

* @return The number of splits available for this Spliterator. May be zero * indicating that no additional splits are recommended or possible. */ int getNaturalSplits();

 /**
  * Returns a Spliterator covering some portion of the elements,

guaranteed * not to overlap with those retained by this Spliterator. After invoking * this method, the current Spliterator will not cover the elements * of the returned Spliterator. The result of calling this method during the * traversal phase is unspecified. *

* @return a Spliterator covering the some portion, possibly empty, of the * data structure elements. *

* @throws IllegalStateException if this Spliterator has already commenced * traversing elements. */ Spliterator split();

 /**
  * Return the Iterator covering the remaining elements. The same

iterator * instance is returned for every invocation. This method initiates the * traversal phase. *

* @return the iterator of the remaining elements. */ Iterator iterator();

 /**
  * Provides any remaining elements into the provided sink.  This

method initiates * the traversal phase if it has not already been initiated. * * @param block The sink to which elements will be provided. */ default void forEach(Block<? super T> block) { Iterator remaining = iterator(); while (remaining.hasNext()) { block.apply(remaining.next()); } }

 /**
  * Returns the count of elements currently retained by this Stream.
  * This is the count which would be processed by {@code into()} or via
  * {@code iterator()}. If the element count cannot be computed

exactly and * cheaply then {@code -1} is returned. The result of calling this method * during the traversal phase is unspecified. *

* @return The number of remaining elements or {@code -1} if the remaining * element count is unavailable. */ default int getSizeIfKnown() { return -1; }

 /**
  * Returns an estimate, possibly inexact, of the count of elements

currently * retained by this Stream. This is the count which would be processed by * {@code into()} or via {@code iterator()}. If an element count estimate * cannot be computed cheaply then {@code -1} is returned. The result of calling * this method during the traversal phase is unspecified. *

* If {@code getSizeIfKnown()} returns a non-negative integer then * {@code estimateSize()} must return the same value. * * @return The number of remaining elements or {@code -1} if the remaining * element count is unavailable. */ default int estimateSize() { return getSizeIfKnown(); }

 /**
  * Return {@code true} if this Spliterator and all returned

Spliterators * provide non-negative size information from {@code getSizeIfKnown} and * {@code estimateSize}. The result of calling this method during the * traversal phase is unspecified. *

* @return {@code true} if size information is available for this * Spliterator and all sub-splits. */ boolean isPredictableSplits(); }

On 9/21/2012 3:07 PM, Brian Goetz wrote:

Various subsets of people have gone around various times on the interface for Spliterator (and by extension, StreamAccessor.) After playing with the implementation a bit, I think we have something decent that balances most of the issues raised.

These issues include: - Extent and mechanism for participating in the "when to split" decision; - State constraints (i.e., can you split after iterating?) - API decisions that force runtime costs. Doug suggested the following for splitting support: long estimateSize() where the implementation could return MAXVALUE to mean "I have no clue". The key constraint is that this estimate (a) never increases and (b) eventually decreases, so that eventually, if perhaps jerkily, it will converge to something that is reasonable to compare against a target chunk size. Separately, I suggested the following, also for splitting support: int getNaturalSplits() which indicates what the "natural" number of splits is for the data structure. (For various reasons, it is probably more natural for this to return one less.) So for a non-splittable data structure this is zero; for an array that we plan to split in a binary fashion, this is 1, for a 4-ary tree, this is 3, etc. It is completely advisory and the client is free to ignore it, but if the client takes it into account, will likely get a more balanced computation tree. It also costs almost nothing to support, and if you always return 1, it degenerates to the same binary splits we have now.

It is also useful to know if we know the exact size of a split, and moreover if we know we'll know exact sizes all the way down. This is true for array sources, for example, and can be used when the pipeline terminates in a toArray, because then the leaves can write their results directly into a big shared array in exactly the right spot. So, for example: list.stream().map(...).toArray() can get by with allocating only a single array for the result, partitioning the input data set, and each leaf writes the results into the right place, avoiding copying. For this, we have two methods: boolean isExactSplits() which means that "all my child spliterators will commit to knowing their size", and long getSizeIfKnown() // returns -1 if unknown where if isExactSplits() is true, getSizeIfKnown is guaranteed to not return -1. For simplicity, since both size-bearing methods (exact and estimated) have a "I don't know" value, they probably should both be -1 rather than one being -1 and one being MAXVALUE. Here's where I currently am on the API: public interface Spliterator { // For split decision support int getNaturalSplits(); int estimateSize() default { return getSizeIfKnown(); } // Exact-sizing support int getSizeIfKnown() default { return -1; } boolean isPredictableSplits() default { return false; } // Element access Spliterator split(); Iterator iterator(); void into(Sink<T, ?, ?> sink) default { sink.begin(estimateSize()); Iterator remaining = iterator(); while (remaining.hasNext()) { sink.accept(remaining.next()); } sink.end(); } } As to state constraints, I am currently thinking: - Can't call split() after calling iterator() Still thinking about: - Should sizing methods also become invalid after calling iterator()? (This would reduce the need to keep the count accurate.) (Both of the above are cheap to enforce because they only add code on the splitting path, not the per-element path.) As to cost imposition, it may look like having an explicit Iterator method (instead of extending Iterator) means creating an Iterator, but what I've found is that in all cases, I can have one class that implements StreamAccessor, Spliterator, and Iterator, so the Iterator implementation can just "return this". Also since we have to write a flag to indicated "have started iterating", with an iterator() method, we have a natural place to write that once, rather than writing it each time through hasNext/next. So when you're ready to start iterating, you usually don't have to create an extra Iterator object. I think this is a pretty good balance of all the issues we've discussed so far, and easy enough to implement efficiently.

Previous message: Fwd: Re: Stream.limit() - puzzler/bug/feature
Next message: Spliterator
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the lambda-libs-spec-observers mailing list