(original) (raw)
5-minute review.
Looks pretty good.
Thanks for fixing old bugs found by FindXXcoderBugs.java
Thanks Ulf for your hard work on implementation and review.
Again, charset changes are very hard to review, hence scary,
but relatively easy to test in an automated fashion.
I imagine generating tests with
random input,
random charset,
random choice of direct/non-direct buffers,
and checking that charset behavior is unchanged
except for performance.
---
// it's time to generate the class file
Confusing - you're generating a source file,
and the phrase "class file" usually means
XX.class
typos:
typitcal
date structure.
correspoinding
ruondtrip
file dbcs has trailing newlines
Martin
Ulf, thanks for the review!
Now we have a bugid for this one. The webrev has been updated to address some of your
suggestions (see my below) and rename to
http://cr.openjdk.java.net/\~sherman/6843578/webrev
(old webrevs are still there with webrev.0x)
With this change,
(1)21% reduce of rt.jar size (from 2.99M-with the euc\_tw reduce to 2.35M)
(2)much faster coding performance, �a quick/not\_really\_a\_benchmark result can been found at
� http://cr.openjdk.java.net/\~sherman/6843578/codingPerm
(3)also addressed the bugids listed below.
6639450: IBM949C encoder modifies state of IBM949 encoder
6569191: Cp943 io converter returns U+0000 and U+FFFD for unconvertable character
6577466: Character encoder IBM970 throws a BufferOverflowException
5065777: CharsetEncoder canEncode() methods often incorrectly return false
The test case test/sun/nio/cs/OLD/TestIBMDB.java is used to make sure there is no mapping
regression, there are new mapping entries added (because switching to the latest IBM cdc
mapping tables for some charsets, which the b2c tables we have are not accurate/complete), but
no conflit with previous mappings.
Alan, Martin, I need you guys also help review the change, I will need a name on the putback message.
The number of file is huge, but only the files listed at beginning need attention:-)
Shermandone! good catch.
Ulf Zibis wrote:
\*\*\* Decoder-Suggestions:
(1) Unused imports in DoubleByte-X.java:
� import java.util.Arrays;
� import sun.nio.cs.StandardCharsets;
� import static sun.nio.cs.CharsetMapping.\*;
� import sun.nio.cs.ext.DoubleByte; �// or instead: static sun.nio.cs.ext.DoubleByte.\*;
A good suggestion. I tried, and it does bring in about 10%+% gain of decoding (for non-direct buffer,
(3) Modify dimension of b2c:
� � char\[\]\[\] b2c = new char\[0x100\]\[segSize\];
� so decode :
� � public char decodeDouble(int b1, int b2) {
� � � � if ((b2-=b2Min) < 0 || b2 >= segSize)
� � � � � � return UNMAPPABLE\_DECODING;
� � � � return b2c\[b1\]\[b2\];
� � }
�Benefit\[1\]: increase performance of decoder
�Benefit\[2\]: reduce memory of B2C\_UNMAPPABLE from 8192 to 512 bytes
�Benefit\[3\]: some of b2c pages could be saved (if only containing \\uFFFD)
we also have similar gian in direct buffer, but the % number is not this big). The webrev has been updated
for this one.Decided to keep it. We do have some charsets (not included this time) that care about b2max.
(4) Don't care about b2Max (it's always not far from 0xff):
�Benefit\[4\]: another performance increase of decoder (only check: (b2-=b2Min) < 0)
Changing from the b2c index from 4-bit to 8-bit (#2) should have already improved this a lot (I see
(5) Truncate String segments (there are 65 % "\\uFFFD" in IBM933):
� (fill b2c segments first with "\\uFFFD", then initialize)
�Benefit\[5\]: save up to 180 % superfluous memory and disk-footprint
a 1.5%+ improvement of overall charsets.jar)I doubt it really saves something real, since the "class" should still keep its copy somewhere...and
(6) Unload b2cStr from memory after startup:
� - outsource b2cStr to additional class file like EUC\_TW approach
� - set b2cStr = null after startup (remove final modifier)
�Benefit\[6\]: avoid 100 % superfluous memory-footprint
I will need it for c2b (now I'm "delaying" the c2b init)I tried again last night. char\[\]\[\] is much faster than the String\[\] version in both client
(7) Avoid copying b2cStr to b2c:
� (String#charAt() is fast as char\[\] access)
�Benefit\[7\]: increase startup performance for decoder
and server vm. So keep it asis. (this was actually I switched from String\[\] to char\[\]\[\])I still want to keep one map for one charset, at least for now. It might be something worth researching
(9) Share mappings (IBM930 and IBM939 are 99 % identical):
�Benefit\[9\]: save up to 99 % superfluous disk-footprint
�Benefit\[10\]: save up to 99 % superfluous memory-footprint (if both charsets are loaded)
in the future.This is not something about engineering. It's about license, policy...
(12) Get rid of sun.io package dependency:
�https://java-nio-charset-enhanced.dev.java.net/source/browse/java-nio-charset-enhanced/tags/milestone2/src/sun/io/
�Benefit[13]: avoid superfluous disk-footprint
�Benefit[14]: save maintenance of sun.io converters
�Disadvantage[1]: published under JRL (waiting for launch of OpenJDK-7 project "charset-enhancement") ;-)
Not like when we dealing with singlebyte charsets. For doublebyte charsets
(17) Decoder#decodeArrayLoop: shortcut for single byte only:
� � int sr = src.remaining();
� � int sl = sp + sr;
� � int dr = dst.remaining();
� � int dl = dp + dr;
� � // single byte only loop
� � int slSB = sp + sr < dr ? sr : dr;
� � while (sp < slSB) {
� � � � char c = b2cSB\[sa\[sp\] && 0xff\];
� � � � if (c == UNMAPPABLE\_DECODING)
� � � � � � break;
� � � � da\[dp++\] = c;
� � � � sp++;
� � }
� �Same for Encoder#encodeArrayLoop
(18) Decoder\_EBCDIC: boolean singlebyteState:
� � if (singlebyteState)
� � � � ...
(19) Decoder\_EBCDIC: decode single byte first:
� � if (singlebyteState)
� � � � c = b2cSB\[b1\];
� � if (c == UNMAPPABLE\_DECODING) {
� � � � ...
� � }
�Benefit\[20\]: should be faster
the priority should be given to doublebyte codepoints, if possible. Not single
byte codepoints.In theory you can do some magic to "join" .nr into .c2b. The price might be more complicated
\*\*\* Encoder-Suggestions:
(21) join \*.nr to \*.c2b files (25->000a becomes 000a->fffd):
�Benefit\[21\]: reduce no. of files
�Benefit\[22\]: simplifies initC2B() (avoids 2 loops)
logic depends on the codepoints. You may end up doing some table lookup for each codepoint
in b2c when processing.
And big thanks for all the suggestions.