RFR: JDK-8184947:,ZipCoder performance improvements (original) (raw)
Xueming Shen xueming.shen at oracle.com
Fri Dec 8 23:09:31 UTC 2017
- Previous message: RFR JDK-6372077: Manifest should handle manifest attribute names up to 70 bytes
- Next message: RFR: JDK-8184947:,ZipCoder performance improvements
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hi,
Please help review the changes for j.u.z.ZipCoder/JDK-8184947 (which also includes cleanup/improvement work in java.lang.StringCoding.java to speed up general String coding performance, especially for UTF8).
issue: https://bugs.openjdk.java.net/browse/JDK-8184947 webrev: http://cr.openjdk.java.net/~sherman/8184947/webrev
jmh benchmark: http://cr.openjdk.java.net/~sherman/8184947/ZipCodingBM.java http://cr.openjdk.java.net/~sherman/8184947/StringCodingBM.java
Notes:
(1) StringCoding.de/encode() for new String()/String.getBytes() with default charset.
For historical reason the existing SC.decode(byte[], off, len)/encode(coder, val) implementation has code to handle any "possible" UnsupportedEncodingExcetion situation and turn to the slow "charset name" version of de/encode() for real work. Given the fact that the Charset.defaultCharset() now returns UTF8 as the fallback default charset if there is anything wrong to obtain a default charset (we did that in jdk7 or 8?), there is no need actually to handle the UEE. This also provides the opportunity to use fastpath for stateless UTF8/88591/ASCII de/encode(). The benchmark data for newString_xxx/ getBytes_xxx (which uses the default encoding, UTF8 in this case) suggests a big speed up fo ascii-only String.
StringCodingBM size) Mode Cnt NEW Score Error OLD Score Error Units
getBytes_ASCII 16 avgt 5 21.155 ± 5.586 63.777 ± 54.262 ns/op getBytes_ASCII 64 avgt 5 20.854 ± 6.237 98.988 ± 62.932 ns/op getBytes_ASCII 256 avgt 5 38.291 ± 8.494 272.306 ± 77.951 ns/op getBytes_Latin 16 avgt 5 80.968 ± 15.814 76.769 ± 38.512 ns/op getBytes_Latin 64 avgt 5 163.078 ± 51.993 219.085 ± 42.665 ns/op getBytes_Latin 256 avgt 5 759.548 ± 99.386 824.594 ± 763.735 ns/op getBytes_Unicode 16 avgt 5 94.311 ± 22.189 124.185 ± 32.751 ns/op getBytes_Unicode 64 avgt 5 289.603 ± 152.056 321.541 ± 103.703 ns/op getBytes_Unicode 256 avgt 5 1253.098 ± 216.243 1201.667 ± 512.532 ns/op
newString_ASCII 16 avgt 5 33.273 ± 13.780 50.402 ± 17.574 ns/op newString_ASCII 64 avgt 5 30.420 ± 6.207 84.989 ± 43.355 ns/op newString_ASCII 256 avgt 5 54.391 ± 10.451 208.096 ± 102.716 ns/op newString_Latin 16 avgt 5 115.606 ± 7.181 114.186 ± 36.310 ns/op newString_Latin 64 avgt 5 393..710 ± 73.478 414.286 ± 176.837 ns/op newString_Latin 256 avgt 5 1618.967 ± 289.044 1551.499 ± 487.904 ns/op newString_Unicode 16 avgt 5 104.848 ± 32.694 127.558 ± 12.029 ns/op newString_Unicode 64 avgt 5 377.894 ± 147.731 374.779 ± 53.028 ns/op newString_Unicode 256 avgt 5 1557.977 ± 318.652 1457.236 ± 284.424 ns/op
(2) updated to "fast path" UTF8/8859-1/ASCII in all de/coding operation, which are all implemented in static /stateless methods. (benchmark for MS932 [4] provide to make sure no regression for "other" charsets)
(3) added "fast path" for "ascii-only' bytes for utf8 encoding/getBytes(). The benchmark [1] suggests a big speedup for ascii-only getBytes() with limited cost to non-ascii-only cases. (this helps big for (4), the ZipCoder situation, which mainly uses ascii only).
(4) java.util.zip.ZipCoder
This is where this patch actually started from. As the rfe suggested we are now using byte[] as the internal storage for the String class, the optimization we put in ZipCoder for UTF8 (which uses the byte[]/char[] interface of out UTF8 implementation to help avoid the relatively heavy ByteBuffer/CharBuffer coding interface) now appears to be not that "optimized". The to/from char[] copy/paste has become a waste.
ZipCoder implementation can't use new String/String.getBytes() directly because of the the different malformed/unmappable character handing requirement. The proposed change here is to add a pair of special new String()/String.getBytes() in StrngCoding class to throw IAE instead of silent replacement, via (yet another) SharedSecrets interface. This brings us much faster de/encoding (30%-50% speed up) and much less memory usage (no more unnecessary byte[]/char[] allocation and in default mode, there is only ONE utf8 ZipCoder), on all "Jar/ZipEntry" related access operations.
ZipCodeBenchMark [latest] * "New Score" is with the patch * getEntry() is mainly String.getBytes(), entries()/stream() is mainly new String(bytes)).
Mode Cnt New Score Error Old Score Units
jf_entries avgt 20 0.582 ± 0.036 0.953 ± 0.108 ms/op jf_getEntry avgt 20 1.506 ± 0.158 2.052 ± 0.171 ms/op jf_stream avgt 20 0.698 ± 0.060 0.940 ± 0.067 ms/op zf_entries avgt 20 0.691 ± 0.057 0.917 ± 0.080 ms/op zf_getEntry avgt 20 1.459 ± 0.180 2.081 ± 0.161 ms/op zf_stream avgt 20 0.626 ± 0.074 0.909 ± 0.075 ms/op
Thanks, Sherman
[1] http://cr.openjdk.java.net/~sherman/8184947/StringCoding.utf8 [2]http://cr.openjdk.java.net/~sherman/8184947/StringCoding.8859_1 [3] http://cr.openjdk.java.net/~sherman/8184947/StringCoding.ascii [4]http://cr.openjdk.java.net/~sherman/8184947/StringCoding.ms932 [5] http://cr.openjdk.java.net/~sherman/8184947/ZipCoding.bm
- Previous message: RFR JDK-6372077: Manifest should handle manifest attribute names up to 70 bytes
- Next message: RFR: JDK-8184947:,ZipCoder performance improvements
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]