Alternative Spliterator for SpinedBuffer (original) (raw)

Peter Levart peter.levart at gmail.com
Thu Apr 18 07:39:20 PDT 2013

Previous message: Alternative Spliterator for SpinedBuffer
Next message: Alternative Spliterator for SpinedBuffer
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi Brian,

I did some performance measurements. The most obvious difference in performance is observed when individual operations in the stream pipeline take relatively long time, but even with fast individual operations, the performance benefit can be observed. For example, this test code:

public class StreamBuilderTest {

 static long test(boolean parallel) {
     StreamBuilder.OfInt builder = Streams.intBuilder();
     for (int i = 0; i < 10000000; i++) {
         builder.accept(i);
     }

     try {
         System.gc();
         Thread.sleep(500L);
         System.gc();
         Thread.sleep(500L);
         System.gc();
         Thread.sleep(500L);
     } catch (InterruptedException e) {}

     long t0 = System.nanoTime();
     double pi = (parallel ? builder.build().parallel() :

builder.build().sequential()) .mapToDouble(i -> (i & 1) == 0 ? 4d / ((2d * i + 2d) * (2d * i + 3d) * (2d

i + 4d)) : -4d / ((2d * i + 2d) * (2d * i + 3d) *

(2d * i + 4d)) ) .sum() + 3d; long t = System.nanoTime() - t0; System.out.println(" pi=" + pi + " in " + t + " ns"); return t; }

 static void testN(int n, boolean parallel) {
     System.out.println("  warm-up");
     for (int i = 0; i < 5; i++) test(parallel);
     System.out.println("  measure");
     long tSum = 0;
     for (int i = 0; i < n; i++) tSum += test(parallel);
     System.out.println("  average: " + ((double) tSum / n) + " ns");
 }

 public static void main(String[] args) {
     System.out.println("Sequential:");
     testN(10, false);
     System.out.println("Parallel:");
     testN(10, true);
 }

}

Prints the following with current Spliterator implementation:

Sequential: warm-up pi=3.1415926535897913 in 115127248 ns pi=3.1415926535897913 in 70426262 ns pi=3.1415926535897913 in 74633777 ns pi=3.1415926535897913 in 70558263 ns pi=3.1415926535897913 in 68995773 ns measure pi=3.1415926535897913 in 67116015 ns pi=3.1415926535897913 in 72918687 ns pi=3.1415926535897913 in 68584475 ns pi=3.1415926535897913 in 67871743 ns pi=3.1415926535897913 in 72419475 ns pi=3.1415926535897913 in 67932912 ns pi=3.1415926535897913 in 71237839 ns pi=3.1415926535897913 in 69636412 ns pi=3.1415926535897913 in 66849404 ns pi=3.1415926535897913 in 72220404 ns average: 6.96787366E7 ns Parallel: warm-up pi=3.141592653589793 in 27772347 ns pi=3.141592653589793 in 18209373 ns pi=3.141592653589793 in 18516459 ns pi=3.141592653589793 in 16655371 ns pi=3.141592653589793 in 19956267 ns measure pi=3.141592653589793 in 21184246 ns pi=3.141592653589793 in 20352169 ns pi=3.141592653589793 in 21195477 ns pi=3.141592653589793 in 21287557 ns pi=3.141592653589793 in 16230650 ns pi=3.141592653589793 in 20035680 ns pi=3.141592653589793 in 18355295 ns pi=3.141592653589793 in 21971399 ns pi=3.141592653589793 in 20552600 ns pi=3.141592653589793 in 19332368 ns average: 2.00497441E7 ns

And the following with alternative Spliterator:

Sequential: warm-up pi=3.1415926535897913 in 111301836 ns pi=3.1415926535897913 in 73685822 ns pi=3.1415926535897913 in 68454448 ns pi=3.1415926535897913 in 71583982 ns pi=3.1415926535897913 in 68397094 ns measure pi=3.1415926535897913 in 67047559 ns pi=3.1415926535897913 in 67424431 ns pi=3.1415926535897913 in 67079863 ns pi=3.1415926535897913 in 73486488 ns pi=3.1415926535897913 in 76055108 ns pi=3.1415926535897913 in 74366550 ns pi=3.1415926535897913 in 69808517 ns pi=3.1415926535897913 in 65960535 ns pi=3.1415926535897913 in 65969289 ns pi=3.1415926535897913 in 70446736 ns average: 6.97645076E7 ns Parallel: warm-up pi=3.1415926535897913 in 21529026 ns pi=3.1415926535897913 in 15567913 ns pi=3.1415926535897913 in 15090407 ns pi=3.1415926535897913 in 15071715 ns pi=3.1415926535897913 in 14593676 ns measure pi=3.1415926535897913 in 15227210 ns pi=3.1415926535897913 in 15406588 ns pi=3.1415926535897913 in 14631733 ns pi=3.1415926535897913 in 15719352 ns pi=3.1415926535897913 in 15201443 ns pi=3.1415926535897913 in 16537214 ns pi=3.1415926535897913 in 15365526 ns pi=3.1415926535897913 in 14674482 ns pi=3.1415926535897913 in 19898638 ns pi=3.1415926535897913 in 16947804 ns average: 1.5960999E7 ns

But don't take this measurements for granted. You can try it for yourself. I'm sure you have lots of performance tests that involve SpinedBuffer spliterator. Here's the updated patch that also incorporates primitive variants:

http://dl.dropboxusercontent.com/u/101777488/jdk8-lambda/SpinedBufferAltSpliterator/webrev.02/index.html

I made it so that original/alternative implementation can be switched with a system property so that the same test can be run with both implementations easily to compare results.

Regards, Peter

On 04/12/2013 07:10 PM, Brian Goetz wrote:

Thanks for taking a look at this code! Your approach does look like it would yield a better decomposition -- and this is a really good time to be looking at this, as the APIs have mostly settled.

I agree that there are improvements to be made here. Have your performance measurements as improved runtime as well as reduced task count? On 4/12/2013 11:10 AM, Peter Levart wrote: On 04/11/2013 09:53 PM, brian.goetz at oracle.com wrote:

Changeset: 5e7f8797b9c1 Author: briangoetz Date: 2013-04-11 15:53 -0400 URL: http://hg.openjdk.java.net/lambda/lambda/jdk/rev/5e7f8797b9c1

Spec pass on StreamBuilder ! src/share/classes/java/util/stream/StreamBuilder.java ! src/share/classes/java/util/stream/Streams.java

Hi, I played a little with StreamBuilder and it works quite optimally using private SpinedBuffer under the hood. The only sub-optimality is when it starts small (min. 16 elements) and then grows big and after that it is used in a parallel computation where leaf task maximum size is more than 16 elements. For example, this code: public class SpinedBufferTest { static int dump(Spliterator spliterator, int level, int iterateSize) { System.out.print(" ".substring(0, level)); System.out.print(spliterator.estimateSize()); Spliterator spl2; if (spliterator.estimateSize() > iterateSize && (spl2 = spliterator.trySplit()) != null) { System.out.println(": split " + spl2.estimateSize() + " + " + spliterator.estimateSize()); return dump(spl2, level + 1, iterateSize) + dump(spliterator, level + 1, iterateSize); } else { Integer[] minMax = new Integer[2]; spliterator.forEachRemaining(x -> { minMax[0] = minMax[0] == null ? x : Math.min(minMax[0], x); minMax[1] = minMax[1] == null ? x : Math.max(minMax[1], x); }); System.out.println(": (" + minMax[0] + "..." + minMax[1] + ")"); return 1; } } public static void main(String[] args) { SpinedBuffer buf = new SpinedBuffer<>(); for (int i = 0; i < 16384; i++) { buf.accept(i); } int leafs = dump(buf.spliterator(), 0, 1024); System.out.println("\nTotal leafs: " + leafs); } } Prints the splits that would occur for 16k elements when max. leaf task size is 1k: 16384: split 16 + 16368 16: (0...15) 16368: split 16 + 16352 16: (16...31) 16352: split 32 + 16320 32: (32...63) 16320: split 64 + 16256 64: (64...127) 16256: split 128 + 16128 128: (128...255) 16128: split 256 + 15872 256: (256...511) 15872: split 512 + 15360 512: (512...1023) 15360: split 1024 + 14336 1024: (1024...2047) 14336: split 2048 + 12288 2048: split 1024 + 1024 1024: (2048...3071) 1024: (3072...4095) 12288: split 4096 + 8192 4096: split 2048 + 2048 2048: split 1024 + 1024 1024: (4096...5119) 1024: (5120...6143) 2048: split 1024 + 1024 1024: (6144...7167) 1024: (7168...8191) 8192: split 4096 + 4096 4096: split 2048 + 2048 2048: split 1024 + 1024 1024: (8192...9215) 1024: (9216...10239) 2048: split 1024 + 1024 1024: (10240...11263) 1024: (11264...12287) 4096: split 2048 + 2048 2048: split 1024 + 1024 1024: (12288...13311) 1024: (13312...14335) 2048: split 1024 + 1024 1024: (14336...15359) 1024: (15360...16383) Total leafs: 22 7 leaf tasks are half the max. size or less. Now SpinedBuffer has chunks with sizes increasing exponentially using power of 2: 16, 16, 32, 64, 128, 256, 512, ... Each chunk has capacity equal to the sum of capacities of all previous chunks. So nearly optimal split (modulo last non-full chunk) would be to always split so that the last chunk is cut-off and the returned Spliterator contains all the chunks up to the last and this Spliterator only contains the last chunk. Here's my attempt at modifying the spliterator to act like that (I only did it for the generic Object variant in 1st try, but I can prepare the primitive variants too, if desired): http://dl.dropboxusercontent.com/u/101777488/jdk8-lambda/SpinedBufferAltSpliterator/webrev.01/index.html With this modification, the above code prints: 16384: split 8192 + 8192 8192: split 4096 + 4096 4096: split 2048 + 2048 2048: split 1024 + 1024 1024: (0...1023) 1024: (1024...2047) 2048: split 1024 + 1024 1024: (2048...3071) 1024: (3072...4095) 4096: split 2048 + 2048 2048: split 1024 + 1024 1024: (4096...5119) 1024: (5120...6143) 2048: split 1024 + 1024 1024: (6144...7167) 1024: (7168...8191) 8192: split 4096 + 4096 4096: split 2048 + 2048 2048: split 1024 + 1024 1024: (8192...9215) 1024: (9216...10239) 2048: split 1024 + 1024 1024: (10240...11263) 1024: (11264...12287) 4096: split 2048 + 2048 2048: split 1024 + 1024 1024: (12288...13311) 1024: (13312...14335) 2048: split 1024 + 1024 1024: (14336...15359) 1024: (15360...16383) Total leafs: 16 Regards, Peter

Previous message: Alternative Spliterator for SpinedBuffer
Next message: Alternative Spliterator for SpinedBuffer
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the lambda-dev mailing list