GetPrimitiveArrayCritical vs GetByteArrayRegion: 140x slow-down using -Xcheck:jni and java.util.zip.DeflaterOutputStream (original) (raw)
David Holmes david.holmes at oracle.com
Mon Mar 5 06:33:20 UTC 2018
- Previous message: GetPrimitiveArrayCritical vs GetByteArrayRegion: 140x slow-down using -Xcheck:jni and java.util.zip.DeflaterOutputStream
- Next message: GetPrimitiveArrayCritical vs GetByteArrayRegion: 140x slow-down using -Xcheck:jni and java.util.zip.DeflaterOutputStream
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hi Ian,
Do you run with -Xcheck:jni in production mode because you load unknown native code libraries? It's mainly intended as a diagnostic option to turn on if you encounter a possible JNI problem.
I'll leave the debate on your actual patch proposal to others more familiar with the library code.
Thanks, David
On 5/03/2018 5:24 AM, Ian Rogers wrote:
Hi,
we've been encountering poor performance with -Xcheck:jni, for the following example the performance is 140x to 510x slower with the flag enabled:
import java.io.ByteArrayOutputStream; import java.io.IOException; import java.util.Random; import java.util.zip.DeflaterOutputStream; public final class CheckJniTest { static void deflateBytesPerformance() throws IOException { byte[] inflated = new byte[1 << 23];_ _new Random(71917367).nextBytes(inflated);_ _ByteArrayOutputStream deflated = new ByteArrayOutputStream();_ _try (DeflaterOutputStream dout = new DeflaterOutputStream(deflated)) {_ _dout.write(inflated, 0, inflated.length);_ _}_ _if (8391174 != deflated.size()) {_ _throw new AssertionError();_ _}_ _}_ _public static void main(String args[]) throws IOException {_ _int n = 5;_ _if (args.length > 0) { n = Integer.parseInt(args[0]); } for (int i = 0; i < n; i++) { long startTime = System.currentTimeMillis(); deflateBytesPerformance(); long endTime = System.currentTimeMillis(); System.out.println("Round " + i + " took " + (endTime - startTime) + "ms"); } } } <<<< The issue is in the libzip Deflater.c JNI code: http://hg.openjdk.java.net/jdk/jdk/file/c5eb27eed365/src/java.base/share/native/libzip/Deflater.c#l131 The test has an 8MB array to deflate/compress. The DeflaterOutputStream has an buffer of size 512 bytes: http://hg.openjdk.java.net/jdk/jdk/file/c5eb27eed365/src/java.base/share/classes/java/util/zip/DeflaterOutputStream.java#l128 To compress the array, 16,384 native calls are made that use the 8MB input array and the 512 byte output array. These arrays are accessed using GetPrimitiveArrayCritical that with -Xcheck:jni copies the array: http://hg.openjdk.java.net/jdk/jdk/file/c5eb27eed365/src/hotspot/share/prims/jniCheck.cpp#l1862 The copying of the arrays leads to 128GB of copying which dominates execution time. One approach to fix the problem is to rewrite libzip in Java. GNU Classpath has such an implementation: http://cvs.savannah.gnu.org/viewvc/classpath/classpath/java/util/zip/Deflater.java?view=markup#l417 A different approach is to use Get/SetByteArrayRegion (using Get/SetByteArrayElements would be no different than the current approach accept potentially causing more copies). I've included a patch and performance data for this approach below where regions of the arrays are copied onto a 512 byte buffer on the stack. The non -Xcheck:jni performance is roughly equivalent before and after the patch, the -Xcheck:jni performance is now similar to the non -Xcheck:jni performance. The choice to go from a using GetPrimitiveArrayCritical to GetByteArrayRegion is a controversial one, as engineers have many different expectations about what critical means and does. GetPrimitiveArrayCritical may have similar overhead to GetByteArrayElements if primitive arrays (possibly of a certain size) are known to be non-moving. There may be a cost to pin critical arrays or regions they exist within. There may be a global or region lock that is in play that can cause interactions with the garbage collector - such interactions may cause tail latency issues in production environments. GetByteArrayRegion loses compared to GetPrimitiveArrayCritical as it must always copy a region of the array for access. Given these issues it is hard to develop a benchmark of GetPrimitiveArrayCritical vs GetByteArrayRegion that can take into account the GC interactions. Most benchmarks will see that avoiding a copy can be good for performance. For more background, Aleksey has a good post on GetPrimitiveArrayCritical here: https://shipilev.net/jvm-anatomy-park/9-jni-critical-gclocker/ A different solution to the performance problem presented here is to change the check JNI implementation to do less copying of arrays. This would be motivated if GetPrimitiveArrayCritical were expected to be used more widely than GetByteArrayRegion in performance sensitive, commonly used code. Given the range of problems possible with GetPrimitiveArrayCritical I'd personally prefer GetByteArrayRegion to be more commonly used, as I'm yet to see a performance problem that made GetPrimitiveArrayCritical so compelling. For example, ObjectOutputStream has burnt me previously: http://hg.openjdk.java.net/jdk/jdk/file/c5eb27eed365/src/java.base/share/native/libjava/ObjectOutputStream.c#l69 and these trivial copy operations, should really be a call to fix the JIT/AOT compilers. Next steps: it'd be great to get this turned in to a bug although its not clear to me whether this is a JDK issue (as it uses GetPrimitiveArrayCritical) or a HotSpot performance issue related to -Xcheck:jni (hence the cross post). We're happy to contribute the attached patch but there should be greater consistency in libzip, were it applied, as there are uses of GetPrimitiveArrayCritical in the inflation/uncompressing code and elsewhere. Thanks, Ian Rogers - Google Current non -Xcheck:jni performance: Round 0 took 382ms Round 1 took 340ms Round 2 took 303ms Round 3 took 256ms Round 4 took 258ms Round 5 took 255ms Round 6 took 260ms Round 7 took 257ms Round 8 took 253ms Round 9 took 246ms Round 10 took 246ms Round 11 took 247ms Round 12 took 245ms Round 13 took 245ms Round 14 took 246ms Round 15 took 244ms Round 16 took 268ms Round 17 took 248ms Round 18 took 247ms Round 19 took 247ms Current -Xcheck:jni performance: Round 0 took 99223ms Round 1 took 56829ms Round 2 took 55620ms Round 3 took 55558ms Round 4 took 55893ms Round 5 took 55779ms Round 6 took 56039ms Round 7 took 54443ms Round 8 took 126490ms Round 9 took 124516ms Round 10 took 125497ms Round 11 took 125022ms Round 12 took 126630ms Round 13 took 122399ms Round 14 took 125623ms Round 15 took 125574ms Round 16 took 123621ms Round 17 took 125374ms Round 18 took 124973ms Round 19 took 125248ms Patched non -Xcheck:jni performance: Round 0 took 363ms Round 1 took 382ms Round 2 took 354ms Round 3 took 274ms Round 4 took 265ms Round 5 took 264ms Round 6 took 266ms Round 7 took 255ms Round 8 took 259ms Round 9 took 245ms Round 10 took 245ms Round 11 took 245ms Round 12 took 245ms Round 13 took 244ms Round 14 took 244ms Round 15 took 245ms Round 16 took 358ms Round 17 took 318ms Round 18 took 311ms Round 19 took 251ms Patched -Xcheck:jni performance: Round 0 took 385ms Round 1 took 373ms Round 2 took 366ms Round 3 took 297ms Round 4 took 301ms Round 5 took 303ms Round 6 took 304ms Round 7 took 308ms Round 8 took 298ms Round 9 took 287ms Round 10 took 290ms Round 11 took 396ms Round 12 took 368ms Round 13 took 328ms Round 14 took 283ms Round 15 took 282ms Round 16 took 296ms Round 17 took 283ms Round 18 took 283ms Round 19 took 281ms
- Previous message: GetPrimitiveArrayCritical vs GetByteArrayRegion: 140x slow-down using -Xcheck:jni and java.util.zip.DeflaterOutputStream
- Next message: GetPrimitiveArrayCritical vs GetByteArrayRegion: 140x slow-down using -Xcheck:jni and java.util.zip.DeflaterOutputStream
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]