Rewrite of IBM doublebyte charsets (original) (raw)
Ulf Zibis Ulf.Zibis at gmx.de
Thu May 21 23:41:24 UTC 2009
- Previous message: Rewrite of IBM doublebyte charsets
- Next message: Rewrite of IBM doublebyte charsets
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Am 21.05.2009 00:22, Xueming Shen schrieb:
Ulf Zibis wrote:
(6) Unload b2cStr from memory after startup: - outsource b2cStr to additional class file like EUCTW approach - set b2cStr = null after startup (remove final modifier) Benefit[6]: avoid 100 % superfluous memory-footprint I doubt it really saves something real, since the "class" should still keep its copy somewhere...and I will need it for c2b (now I'm "delaying" the c2b init)
I always thought, setting an object to null after use, it would be automatically GCed. Am I wrong? ... but we can do c2binit from b2c[][] instead from b2cstr[], so why saving it.
(7) Avoid copying b2cStr to b2c: (String#charAt() is fast as char[] access) Benefit[7]: increase startup performance for decoder I tried again last night. char[][] is much faster than the String[] version in both client and server vm. So keep it asis. (this was actually I switched from String[] to char[][])
I'm surprised, because I had in mind from older benchmarks, that char_array[index] had same speed than str.charAt(index) after optimization from hotspot. I also had this results here: https://java-nio-charset-enhanced.dev.java.net/source/browse/java-nio-charset-enhanced/branches/array_io_string/src/sun/nio/cs/SingleByteFastDecoder.java?rev=&view=markup
(12) Get rid of sun.io package dependency:
https://java-nio-charset-enhanced.dev.java.net/source/browse/java-nio-charset-enhanced/tags/milestone2/src/sun/io/ Benefit[13]: avoid superfluous disk-footprint Benefit[14]: save maintenance of sun.io converters Disadvantage[1]: published under JRL (waiting for launch of OpenJDK-7 project "charset-enhancement") ;-) This is not something about engineering. It's about license, policy...
So hopefully we would have OpenJDK7 project "charset-enhancement" soon.
(17) Decoder#decodeArrayLoop: shortcut for single byte only: int sr = src.remaining(); int sl = sp + sr; int dr = dst.remaining(); int dl = dp + dr; // single byte only loop int slSB = sp + sr < dr ? sr : dr; while (sp < slSB) { char c = b2cSB[sa[sp] && 0xff]; if (c == UNMAPPABLEDECODING) break; da[dp++] = c; sp++; } Same for Encoder#encodeArrayLoop (18) DecoderEBCDIC: boolean singlebyteState: if (singlebyteState) ... (19) DecoderEBCDIC: decode single byte first: if (singlebyteState) c = b2cSB[b1]; if (c == UNMAPPABLEDECODING) { ... } Benefit[20]: should be faster Not like when we dealing with singlebyte charsets. For doublebyte charsets the priority should be given to doublebyte codepoints, if possible. Not single byte codepoints.
- I am in assumption that having singlebyte-only input is common use case. Am I wrong in case of EBCDIC ?
- This hack doesn't make processing of "normal" mixed input slower after escaping to "normal" while(...)-loop.
- This hack was copied from UTF-8 coder, where ASCII-only input is common use case.
*** Encoder-Suggestions: (21) join *.nr to *.c2b files (25->000a becomes 000a->fffd): Benefit[21]: reduce no. of files Benefit[22]: simplifies initC2B() (avoids 2 loops) In theory you can do some magic to "join" .nr into .c2b. The price might be more complicated logic depends on the codepoints. You may end up doing some table lookup for each codepoint in b2c when processing.
This "magic" should be done in GenerateDBCS.java, so the price must only be paid once while building the JDK. But to be honest, it could be done by hand, for those few mapping pairs. See my single-byte IBMxxx mappings here: https://java-nio-charset-enhanced.dev.java.net/source/browse/java-nio-charset-enhanced/trunk/make/tools/CharsetMapping/ext/ ... and don't forget, it prevents from copying the whole b2c.
And big thanks for all the suggestions.
Thanks for your appreciation. :-)
-Ulf
- Previous message: Rewrite of IBM doublebyte charsets
- Next message: Rewrite of IBM doublebyte charsets
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]