[OpenJDK 2D-Dev] AAShapePipe concurrency & memory waste (original) (raw)
Laurent Bourgès bourges.laurent at gmail.com
Wed Apr 10 11:20:23 UTC 2013
- Previous message: [OpenJDK 2D-Dev] AAShapePipe concurrency & memory waste
- Next message: [OpenJDK 2D-Dev] AAShapePipe concurrency & memory waste
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Peter,
1/ I modified your TestRunner class to print total ops and perform explicit GC before runTests: http://jmmc.fr/~bourgesl/share/AAShapePipe/microbench/
2/ I applied your advice but it does not change much:
-XX:ClassMetaspaceSize=104857600 -XX:InitialHeapSize=134217728
-XX:MaxHeapSize=134217728 -XX:+PrintCommandLineFlags -XX:-PrintFlagsFinal -XX:+UseCompressedKlassPointers -XX:+UseCompressedOops -XX:+UseParallelGC >> JVM START: 1.8.0-internal [OpenJDK 64-Bit Server VM 25.0-b24] #------------------------------------------------------------- # ContextGetInt4: run duration: 10 000 ms # # Warm up: # 4 threads, Tavg = 13,84 ns/op (σ = 0,23 ns/op), Total ops = 2889056179 [ 13,93 (717199825), 13,87 (720665624), 13,48 (741390545), 14,09 (709800185)] # 4 threads, Tavg = 14,25 ns/op (σ = 0,57 ns/op), Total ops = 2811615084 [ 15,21 (658351236), 14,18 (706254551), 13,94 (718202949), 13,74 (728806348)] cleanup (explicit Full GC) ... cleanup done. # Measure: 1 threads, Tavg = 5,96 ns/op (σ = 0,00 ns/op), Total ops = 1678357614 [ 5,96 (1678357614)] 2 threads, Tavg = 7,33 ns/op (σ = 0,03 ns/op), Total ops = 2729723450 [ 7,31 (1369694121), 7,36 (1360029329)] 3 threads, Tavg = 10,65 ns/op (σ = 2,73 ns/op), Total ops = 2817154340 [ 13,24 (755190111), 13,23 (755920429), 7,66 (1306043800)] 4 threads, Tavg = 15,44 ns/op (σ = 3,33 ns/op), Total ops = 2589897733 [ 17,05 (586353618), 19,23 (519345153), 17,88 (559401974), 10,81 (924796988)]
-XX:ClassMetaspaceSize=104857600 -XX:InitialHeapSize=134217728
-XX:MaxHeapSize=134217728 -XX:+PrintCommandLineFlags -XX:-PrintFlagsFinal -XX:-TieredCompilation -XX:+UseCompressedKlassPointers -XX:+UseCompressedOops -XX:+UseParallelGC >> JVM START: 1.8.0-internal [OpenJDK 64-Bit Server VM 25.0-b24] #------------------------------------------------------------- # GetInt4: run duration: 10 000 ms # # Warm up: # 4 threads, Tavg = 31,56 ns/op (σ = 0,43 ns/op), Total ops = 1267295706 [ 31,30 (319512554), 31,02 (322293333), 32,12 (311334550), 31,82 (314155269)] # 4 threads, Tavg = 30,75 ns/op (σ = 1,81 ns/op), Total ops = 1302123211 [ 32,21 (310949394), 32,37 (309275124), 27,87 (359125007), 31,01 (322773686)] cleanup (explicit Full GC) ... cleanup done. # Measure: 1 threads, Tavg = 8,36 ns/op (σ = 0,00 ns/op), Total ops = 1196238323 [ 8,36 (1196238323)] 2 threads, Tavg = 14,95 ns/op (σ = 0,04 ns/op), Total ops = 1337648720 [ 15,00 (666813210), 14,91 (670835510)] 3 threads, Tavg = 20,65 ns/op (σ = 0,99 ns/op), Total ops = 1453119707 [ 19,57 (511100480), 21,97 (455262170), 20,54 (486757057)] 4 threads, Tavg = 30,76 ns/op (σ = 0,54 ns/op), Total ops = 1301090278 [ 31,51 (317527231), 30,79 (324921525), 30,78 (325063322), 29,99 (333578200)] # << JVM END
3/ I tried several heap settings: without Xms/Xmx ... but it has almost no effect.
*Should I play with TLAB resize / initial size ? or different GC collector (G1 ...) ?
Does anybody can explain me what PrintTLAB mean ?*
Laurent
2013/4/10 Peter Levart <peter.levart at gmail.com>
Hi Laurent,
Could you disable tiered compilation for performance tests? Tiered compilation is usually a source of jitter in the results. Pass -XX:-TieredCompilation to the VM. Regards, Peter
On 04/10/2013 10:58 AM, Laurent Bourgès wrote: Dear Jim, 2013/4/9 Jim Graham <james.graham at oracle.com>
The allocations will always show up on a heap profiler, I don't know of any way of having them not show up if they are stack allocated, but I don't think that stack allocation is the issue here - small allocations come out of a fast generation that costs almost nothing to allocate from and nearly nothing to clean up. They are actually getting allocated and GC'd, but the process is optimized. The only way to tell is to benchmark and see which changes make a difference and which are in the noise (or, in some odd counter-intuitive cases, counter-productive)... ...jim I advocate I like GC because it avoids in Java dealing with pointers like C/C++ does; however, I prefer GC clean real garbage (application...) than wasted memory: I prefer not count on GC when I can avoid wasting memory that gives GC more work = reduce useless garbage (save the planet) ! Moreover, GC and / or Thread local allocation (TLAB) seems to have more overhead than you think = "fast generation that costs almost nothing to allocate from and nearly nothing to clean up". Here are my micro-benchmark results related to int[4] allocation where I mimic the AAShapePipe.fillParallelogram() method: Patch Ref Gain 5,96 8,27 138,76% 7,31 14,96 204,65% 10,65 20,4 191,55% 15,44 29,83 193,20% Test environment: Linux64 with OpenJDK8 (2 real cpu cores, 4 virtual cpus) JVM settings: -XX:+PrintCommandLineFlags -XX:-PrintFlagsFinal -Xms128m -Xmx128m Benchmark code (using Peter Levart microbench classes): http://jmmc.fr/~bourgesl/share/AAShapePipe/microbench/ My conclusion is: "nothing" > zero (allocation + cleanup) and it is very noticeable in multi threading tests. I advocate that I use a dirty int[4] array (no cleanup) but it is not necessary : maybe the performance gain come from that reason. Finally here is the output with -XX:+PrintTLAB flag: TLAB: gc thread: 0x00007f105813d000 [id: 4053] desiredsize: 1312KB slow allocs: 0 refill waste: 20992B alloc: 1,00000 65600KB refills: 20 waste 1,2% gc: 323712B slow: 600B fast: 0B TLAB: gc thread: 0x00007f105813a800 [id: 4052] desiredsize: 1312KB slow allocs: 0 refill waste: 20992B alloc: 1,00000 65600KB refills: 7 waste 7,9% gc: 745568B slow: 176B fast: 0B TLAB: gc thread: 0x00007f1058138800 [id: 4051] desiredsize: 1312KB slow allocs: 0 refill waste: 20992B alloc: 1,00000 65600KB refills: 15 waste 3,1% gc: 618464B slow: 448B fast: 0B TLAB: gc thread: 0x00007f1058136800 [id: 4050] desiredsize: 1312KB slow allocs: 0 refill waste: 20992B alloc: 1,00000 65600KB refills: 7 waste 0,0% gc: 0B slow: 232B fast: 0B TLAB: gc thread: 0x00007f1058009000 [id: 4037] desiredsize: 1312KB slow allocs: 0 refill waste: 20992B alloc: 1,00000 65600KB refills: 1 waste 27,5% gc: 369088B slow: 0B fast: 0B TLAB totals: thrds: 5 refills: 50 max: 20 slow allocs: 0 max 0 waste: 3,1% gc: 2056832B max: 745568B slow: 1456B max: 600B fast: 0B max: 0B I would have expected that TLAB can recycle all useless int[4] arrays as fast as possible => waste = 100% ??? *Is there any bug in TLAB (core-libs) ? Should I send such issue to hotspot team ? * Test using ThreadLocal AAShapePipeContext: { AAShapePipeContext ctx = getThreadContext(); int abox[] = ctx.abox; // use array: // mimic: AATileGenerator aatg = renderengine.getAATileGenerator(x, y, dx1, dy1, dx2, dy2, 0, 0, clip, abox); abox[0] = 7; abox[1] = 11; abox[2] = 13; abox[3] = 17; // mimic: renderTiles(sg, computeBBox(ux1, uy1, ux2, uy2), aatg, abox); devNull1.yield(abox); if (!useThreadLocal) { restoreContext(ctx); } } -XX:ClassMetaspaceSize=104857600 -XX:InitialHeapSize=134217728 -XX:MaxHeapSize=134217728 -XX:+PrintCommandLineFlags -XX:-PrintFlagsFinal -XX:+UseCompressedKlassPointers -XX:+UseCompressedOops -XX:+UseParallelGC >> JVM START: 1.8.0-internal [OpenJDK 64-Bit Server VM 25.0-b24] #------------------------------------------------------------- # ContextGetInt4: run duration: 10 000 ms # # Warm up: # 4 threads, Tavg = 13,84 ns/op (σ = 0,23 ns/op), Total ops = 2889056179 [ 13,93 (717199825), 13,87 (720665624), 13,48 (741390545), 14,09 (709800185)] # 4 threads, Tavg = 14,25 ns/op (σ = 0,57 ns/op), Total ops = 2811615084 [ 15,21 (658351236), 14,18 (706254551), 13,94 (718202949), 13,74 (728806348)] cleanup (explicit Full GC) ... cleanup done. # Measure: *1 threads, Tavg = 5,96 ns/op (σ = 0,00 ns/op), Total ops = 1678357614 [ 5,96 (1678357614)] 2 threads, Tavg = 7,33 ns/op (σ = 0,03 ns/op), Total ops = 2729723450 [ 7,31 (1369694121), 7,36 (1360029329)] 3 threads, Tavg = 10,65 ns/op (σ = 2,73 ns/op), Total ops = 2817154340 [ 13,24 (755190111), 13,23 (755920429), 7,66 (1306043800)] **4 threads, Tavg = 15,44 ns/op (σ = 3,33 ns/op), Total ops = 2589897733 [ 17,05 (586353618), 19,23 (519345153), 17,88 (559401974), 10,81 *(924796988)] # << JVM END_ _*Test using standard int[4] allocation:*_ _{_ _int abox[] = new int[4];_ _// use array:_ _// mimic: AATileGenerator aatg = renderengine.getAATileGenerator(x, y,_ _dx1, dy1, dx2, dy2, 0, 0, clip, abox);_ _abox[0] = 7;_ _abox[1] = 11;_ _abox[2] = 13;_ _abox[3] = 17;_ _// mimic: renderTiles(sg, computeBBox(ux1, uy1, ux2, uy2), aatg, abox);_ _devNull1.yield(abox);_ _}_ _-XX:ClassMetaspaceSize=104857600 -XX:InitialHeapSize=134217728_ _-XX:MaxHeapSize=134217728 -XX:+PrintCommandLineFlags -XX:-PrintFlagsFinal_ _-XX:+UseCompressedKlassPointers -XX:+UseCompressedOops -XX:+UseParallelGC_ _>> JVM START: 1.8.0-internal [OpenJDK 64-Bit Server VM 25.0-b24] #------------------------------------------------------------- # GetInt4: run duration: 10 000 ms # # Warm up: # 4 threads, Tavg = 31,07 ns/op (σ = 0,60 ns/op), Total ops = 1287292142 [ 30,26 (330475567), 31,92 (313328449), 31,27 (319805520), 30,89 (323682606)] # 4 threads, Tavg = 30,94 ns/op (σ = 0,33 ns/op), Total ops = 1293000783 [ 30,92 (323382193), 30,61 (326730340), 31,48 (317621402), 30,74 (325266848)] cleanup (explicit Full GC) ... cleanup done. # Measure: *1 threads, Tavg = 8,27 ns/op (σ = 0,00 ns/op), Total ops = 1209213909 [ 8,27 (1209213909)] 2 threads, Tavg = 14,96 ns/op (σ = 0,04 ns/op), Total ops = 1337024734 [ 15,00 (666659967), 14,92 (670364767)] 3 threads, Tavg = 20,40 ns/op (σ = 1,03 ns/op), Total ops = 1470560922 [ 21,21 (471592958), 19,00 (526302911), 21,16 (472665053)] **4 threads, Tavg = 29,83 ns/op (σ = 1,82 ns/op), Total ops = 1340065128 [ 31,17 (320806983), 31,58 (316358130), 26,94 (370806790), 30,11 *(332093225)] # << JVM END Best regards, Laurent
- Previous message: [OpenJDK 2D-Dev] AAShapePipe concurrency & memory waste
- Next message: [OpenJDK 2D-Dev] AAShapePipe concurrency & memory waste
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]