GetPrimitiveArrayCritical vs GetByteArrayRegion: 140x slow-down using -Xcheck:jni and java.util.zip.DeflaterOutputStream (original) (raw)
Ian Rogers irogers at google.com
Fri Mar 16 17:19:58 UTC 2018
- Previous message: GetPrimitiveArrayCritical vs GetByteArrayRegion: 140x slow-down using -Xcheck:jni and java.util.zip.DeflaterOutputStream
- Next message: GetPrimitiveArrayCritical vs GetByteArrayRegion: 140x slow-down using -Xcheck:jni and java.util.zip.DeflaterOutputStream
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Thanks Paul, very interesting.
On Fri, Mar 16, 2018 at 9:21 AM Paul Sandoz <paul.sandoz at oracle.com> wrote:
Hi Ian, Thomas,
Some background on the bulk copying for byte buffers after talking with Mikael who worked on these changes a few years ago. Please correct the following if needed as our memory is hazy :-) IIRC at the time we felt this was a reasonable thing to do because C2 did not strip mine the loops for the bulk copying of large Java arrays i.e. the issue was there anyway for more common cases. However, i believe that may no longer be the so in some cases after Roland implemented loop strip mining in C2 [1]. So we should go back and revisit/check the current support in buffers and Java arrays (System.arraycopy etc).
The C2 issue is a well known TTSP issue :-) Its great that there is a strip mining optimization, revisiting the bulk copies would be great!
(This is also something we need to consider if we modify buffers to support capacities larger than Integer.MAXVALUE. Also connects with Project Panama.)
If Thomas has not done so or does not plan to i can log an issue for you.
That'd be great. I wonder if identifying more TTSP issues should also be a bug. Its interesting to observe that overlooking TTSP in C2 motivated the Unsafe.copyMemory change permitting a fresh TTSP issue. If TTSP is a 1st class issue then maybe we can deprecate JNI critical regions to support that effort :-)
Thanks, Ian
Paul.
[1] https://bugs.openjdk.java.net/browse/JDK-8186027 On Mar 15, 2018, at 10:49 AM, Ian Rogers <irogers at google.com> wrote: +hotspot-gc-dev On Thu, Mar 15, 2018 at 1:25 AM Thomas Schatzl <thomas.schatzl at oracle.com> wrote: Hi, On Thu, 2018-03-15 at 01:00 +0000, Ian Rogers wrote: An old data point on how large a critical region should be comes from java.nio.Bits. In JDK 9 the code migrated into unsafe, but in JDK 8 the copies within a critical region were bound at most copying 1MB: http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/687fd7c7986d/src/share/ native/java/nio/Bits.c#l88 This is inconsistent with Deflater and ObjectOutputStream which both allow unlimited arrays and thereby critical region sizes. In JDK 9 the copies starve the garbage collector in nio Bits too as there is no 1MB limit to the copy sizes: http://hg.openjdk.java.net/jdk/jdk/rev/f70e100d3195 which came from: https://bugs.openjdk.java.net/browse/JDK-8149596 Perhaps this is a regression not demonstrated due to the testing challenge. [...] It doesn't seem unreasonable to have the loops for the copies occur in 1MB chunks but JDK-8149596 lost this and so I'm confused on what the HotSpot stand point is. Please file a bug (seems to be a core-libs/java.nio regression?), preferably with some kind of regression test. Also file enhancements (I would guess) for the other cases allowing unlimited arrays. I don't have perms to file bugs there's some catch-22 scenario in getting the permissions. Happy to have a bug filed or to file were that not an issue. Happy to create a test case but can't see any others for TTSP issues. This feels like a potential use case for jmh, perhaps run the benchmark well having a separate thread run GC bench. Should there be a bug to add, in debug mode, a TTSP watcher thread whose job it is to bring "random" threads into safepoints and report on tardy ones? Should there be a bug to warn on being in a JNI critical for more than just a short period? Seems like there should be a bug on Unsafe.copyMemory and Unsafe.copySwapMemory having TTSP issues. Seems like there should be a bug on all uses of critical that don't chunk their critical region work based on some bound (like 1MB chunks for nio Bits)? How are these bounds set? A past reference that I've lost is in having the bound be the equivalent of 65535 bytecodes due to the expectation of GC work at least once in a method or on a loop backedge - I thought this was in a spec somewhere but now I can't find it. The bytecode size feels as arbitrary as 1MB, a time period would be better but that can depend on the kind of GC you want as delays with concurrent GC mean more than non-concurrent. Clearly the chunk size shouldn't just be 0, but this appears to currently be the norm in the JDK. The original reason for coming here was a 140x slow down in -Xcheck:jni in Deflater.deflate There are a few options there that its useful to enumerate: 1) rewrite in Java but there are correctness and open source related issues 2) remove underflow/overflow protection from critical arrays (revert JDK-6311046 or perhaps bound protection to arrays of a particular small size) - this removes checking and doesn't deal with TTSP 3) add a critical array slice API to JNI so that copies with -Xcheck:jni aren't unbounded (martinrb@ proposed this) - keeps checking but doesn't deal with TTSP 4) rewrite primitive array criticals with GetArrayRegion as O(n) beats the "silent killer" TTSP (effectively deprecate the critical APIs) In general (ie not just the deflate case) I think (1) is the most preferable. (2) and (3) both have TTSP issues. (4) isn't great performance wise, which motivates more use of approach (1), but I think deprecating criticals may just be the easiest and sanest way forward. I think that discussion is worth having on an e-mail thread rather than a bug. Long TTSP is a performance bug as any other. In a way criticals are better than unsafe as they may pin the memory and not starve GC, which shenandoah does. (Region based) Object pinning has its own share of problems: - only (relatively) easily implemented in region based collectors - may slow down pause a bit in presence of pinned regions/objects (for non-concurrent copying collectors) - excessive use of pinning may cause OOME and VM exit probably earlier than the gc locker. GC locker seems to provide a more gradual degradation. E.g. pinning regions typically makes these regions unavailable for allocation. I.e. you still should not use it for many, very long living objects. Of course this somewhat depends on the sophistication of the implementation. I think region based pinning would be a good addition to other collectors than Shenandoah too. It has been on our minds for a long time, but there are so many other more important issues :), so of course we are eager to see contributions in this area. ;) If you are interested on working on this, please ping us on hotspot-gc- dev for implementation ideas to get you jump-started. Thanks, Thomas I'd rather deprecate criticals than build upon the complexity, but I'm very glad this is a concern. Thanks, Ian
- Previous message: GetPrimitiveArrayCritical vs GetByteArrayRegion: 140x slow-down using -Xcheck:jni and java.util.zip.DeflaterOutputStream
- Next message: GetPrimitiveArrayCritical vs GetByteArrayRegion: 140x slow-down using -Xcheck:jni and java.util.zip.DeflaterOutputStream
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]