[OpenJDK 2D-Dev] AAShapePipe concurrency & memory waste (original) (raw)

Laurent Bourgès bourges.laurent at gmail.com
Wed Apr 10 08:58:39 UTC 2013

Previous message: [OpenJDK 2D-Dev] AAShapePipe concurrency & memory waste
Next message: [OpenJDK 2D-Dev] AAShapePipe concurrency & memory waste
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Dear Jim,

2013/4/9 Jim Graham <james.graham at oracle.com>

The allocations will always show up on a heap profiler, I don't know of any way of having them not show up if they are stack allocated, but I don't think that stack allocation is the issue here - small allocations come out of a fast generation that costs almost nothing to allocate from and nearly nothing to clean up. They are actually getting allocated and GC'd, but the process is optimized. The only way to tell is to benchmark and see which changes make a difference and which are in the noise (or, in some odd counter-intuitive cases, counter-productive)... ...jim

I advocate I like GC because it avoids in Java dealing with pointers like C/C++ does; however, I prefer GC clean real garbage (application...) than wasted memory: I prefer not count on GC when I can avoid wasting memory that gives GC more work = reduce useless garbage (save the planet) !

Moreover, GC and / or Thread local allocation (TLAB) seems to have more overhead than you think = "fast generation that costs almost nothing to allocate from and nearly nothing to clean up".

Here are my micro-benchmark results related to int[4] allocation where I mimic the AAShapePipe.fillParallelogram() method: Patch Ref Gain 5,96 8,27 138,76% 7,31 14,96 204,65% 10,65 20,4 191,55% 15,44 29,83 193,20% Test environment: Linux64 with OpenJDK8 (2 real cpu cores, 4 virtual cpus) JVM settings: -XX:+PrintCommandLineFlags -XX:-PrintFlagsFinal -Xms128m -Xmx128m

Benchmark code (using Peter Levart microbench classes): http://jmmc.fr/~bourgesl/share/AAShapePipe/microbench/

My conclusion is: "nothing" > zero (allocation + cleanup) and it is very noticeable in multi threading tests.

I advocate that I use a dirty int[4] array (no cleanup) but it is not necessary : maybe the performance gain come from that reason.

Finally here is the output with -XX:+PrintTLAB flag: TLAB: gc thread: 0x00007f105813d000 [id: 4053] desired_size: 1312KB slow allocs: 0 refill waste: 20992B alloc: 1,00000 65600KB refills: 20 waste 1,2% gc: 323712B slow: 600B fast: 0B TLAB: gc thread: 0x00007f105813a800 [id: 4052] desired_size: 1312KB slow allocs: 0 refill waste: 20992B alloc: 1,00000 65600KB refills: 7 waste 7,9% gc: 745568B slow: 176B fast: 0B TLAB: gc thread: 0x00007f1058138800 [id: 4051] desired_size: 1312KB slow allocs: 0 refill waste: 20992B alloc: 1,00000 65600KB refills: 15 waste 3,1% gc: 618464B slow: 448B fast: 0B TLAB: gc thread: 0x00007f1058136800 [id: 4050] desired_size: 1312KB slow allocs: 0 refill waste: 20992B alloc: 1,00000 65600KB refills: 7 waste 0,0% gc: 0B slow: 232B fast: 0B TLAB: gc thread: 0x00007f1058009000 [id: 4037] desired_size: 1312KB slow allocs: 0 refill waste: 20992B alloc: 1,00000 65600KB refills: 1 waste 27,5% gc: 369088B slow: 0B fast: 0B TLAB totals: thrds: 5 refills: 50 max: 20 slow allocs: 0 max 0 waste: 3,1% gc: 2056832B max: 745568B slow: 1456B max: 600B fast: 0B max: 0B

I would have expected that TLAB can recycle all useless int[4] arrays as fast as possible => waste = 100% ???

*Is there any bug in TLAB (core-libs) ? Should I send such issue to hotspot team ? *

Test using ThreadLocal AAShapePipeContext: { AAShapePipeContext ctx = getThreadContext(); int abox[] = ctx.abox;

// use array:
// mimic: AATileGenerator aatg = renderengine.getAATileGenerator(x, y,

dx1, dy1, dx2, dy2, 0, 0, clip, abox); abox[0] = 7; abox[1] = 11; abox[2] = 13; abox[3] = 17;

// mimic: renderTiles(sg, computeBBox(ux1, uy1, ux2, uy2), aatg, abox);
devNull1.yield(abox);

if (!useThreadLocal) {
    restoreContext(ctx);
}

}

-XX:ClassMetaspaceSize=104857600 -XX:InitialHeapSize=134217728 -XX:MaxHeapSize=134217728 -XX:+PrintCommandLineFlags -XX:-PrintFlagsFinal -XX:+UseCompressedKlassPointers -XX:+UseCompressedOops -XX:+UseParallelGC

JVM START: 1.8.0-internal [OpenJDK 64-Bit Server VM 25.0-b24] #-------------------------------------------------------------

ContextGetInt4: run duration: 10 000 ms

Warm up:

4 threads, Tavg = 13,84 ns/op (σ = 0,23 ns/op), Total ops

= 2889056179 [ 13,93 (717199825), 13,87 (720665624), 13,48 (741390545), 14,09 (709800185)]

4 threads, Tavg = 14,25 ns/op (σ = 0,57 ns/op), Total ops

= 2811615084 [ 15,21 (658351236), 14,18 (706254551), 13,94 (718202949), 13,74 (728806348)] cleanup (explicit Full GC) ... cleanup done.

Measure:

*1 threads, Tavg = 5,96 ns/op (σ = 0,00 ns/op), Total ops = 1678357614 [ 5,96 (1678357614)] 2 threads, Tavg = 7,33 ns/op (σ = 0,03 ns/op), Total ops = 2729723450 [ 7,31 (1369694121), 7,36 (1360029329)] 3 threads, Tavg = 10,65 ns/op (σ = 2,73 ns/op), Total ops = 2817154340 [ 13,24 (755190111), 13,23 (755920429), 7,66 (1306043800)] **4 threads, Tavg = 15,44 ns/op (σ = 3,33 ns/op), Total ops = 2589897733 [ 17,05 (586353618), 19,23 (519345153), 17,88 (559401974), 10,81 *(924796988)] # << JVM END

Test using standard int[4] allocation: { int abox[] = new int[4];

  // use array:
// mimic: AATileGenerator aatg = renderengine.getAATileGenerator(x, y,

dx1, dy1, dx2, dy2, 0, 0, clip, abox); abox[0] = 7; abox[1] = 11; abox[2] = 13; abox[3] = 17;

// mimic: renderTiles(sg, computeBBox(ux1, uy1, ux2, uy2), aatg, abox);
devNull1.yield(abox);

}

JVM START: 1.8.0-internal [OpenJDK 64-Bit Server VM 25.0-b24] #-------------------------------------------------------------

GetInt4: run duration: 10 000 ms

Warm up:

4 threads, Tavg = 31,07 ns/op (σ = 0,60 ns/op), Total ops

= 1287292142 [ 30,26 (330475567), 31,92 (313328449), 31,27 (319805520), 30,89 (323682606)]

4 threads, Tavg = 30,94 ns/op (σ = 0,33 ns/op), Total ops

= 1293000783 [ 30,92 (323382193), 30,61 (326730340), 31,48 (317621402), 30,74 (325266848)] cleanup (explicit Full GC) ... cleanup done.

Measure:

*1 threads, Tavg = 8,27 ns/op (σ = 0,00 ns/op), Total ops = 1209213909 [ 8,27 (1209213909)] 2 threads, Tavg = 14,96 ns/op (σ = 0,04 ns/op), Total ops = 1337024734 [ 15,00 (666659967), 14,92 (670364767)] 3 threads, Tavg = 20,40 ns/op (σ = 1,03 ns/op), Total ops = 1470560922 [ 21,21 (471592958), 19,00 (526302911), 21,16 (472665053)] **4 threads, Tavg = 29,83 ns/op (σ = 1,82 ns/op), Total ops = 1340065128 [ 31,17 (320806983), 31,58 (316358130), 26,94 (370806790), 30,11 *(332093225)] # << JVM END

Best regards, Laurent

Previous message: [OpenJDK 2D-Dev] AAShapePipe concurrency & memory waste
Next message: [OpenJDK 2D-Dev] AAShapePipe concurrency & memory waste
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the core-libs-dev mailing list