Rewrite of IBM doublebyte charsets (original) (raw)

Ulf Zibis Ulf.Zibis at gmx.de
Sat May 9 17:50:39 UTC 2009


Am 01.05.2009 08:48, Xueming Shen schrieb:

Hi,

While I'm waiting for Alan's code-review result for my rewriting of EUCTW http://cr.openjdk.java.net/~sherman/68317946229811/webrev (much faster, much smaller, near 8% decrease of size of charsets.jar with one charset update. OK, it's a shame...I meant the old data structure)

EUC_TW statistics:

Plane range length segments segments-usage-ratio

0 a1a1-fdcb 5868 5d = 93 66 % _0 a1a1-a744 434 7 = 7 65 % _1 c2a1-fdcb 5434 3c = 60 95 %

1:8ea2 -f2c4 7650 52 = 82 98 % 2:8ea3 -e7aa 6394 47 = 71 95 % 3:8ea4 -eedc 7286 4e = 78 98 % 4:8ea5 -fcd1 8601 5c = 92 98 % 5:8ea6 -e4fa 6385 44 = 68 99 % 6:8ea7 -ebd5 6532 4b = 75 98 % 7:8eaf -edb9 8721 4d = 77 92 %

Sum: 55446 262 = 610

memory amount for all segments (not truncated): 610 * 95 = 57950 code points

*** Decoder-Suggestions:

(1) Increase dimension of b2c and decouple plane 0: String[] b2c = new String[0x10] String b2c_0 = ... Benefit[1]: save calculation of plane no. to range 0..7 (but mask by 0xa0) Benefit[2]: save range-check for plane (catch malformed plane by NPE) sophisticated (additionally save masking of plane no.): String[] b2c = new String[0xb0]

(2) Save Strings in 2-dimensional array: String[][] b2c = new String[0x10][] String[] b2c_0 = new String[0x5d] b2c[0x2] = new String[0x52] b2c[0x3] = new String[0x47] b2c[0x4] = new String[0x4e] b2c[0x5] = new String[0x5c] b2c[0x6] = new String[0x44] b2c[0x7] = new String[0x4b] b2c[0xf] = new String[0x4d] sophisticated (segments a8..c1 are unused in plane 0): String[] b2c_0 = new String[0x07] String[] b2c_1 = new String[0x3c] Benefit[3]: save calculation of index (multiplying with dbSegSize), but add 1 indirection Benefit[4]: save range-check for segment index (catch malformed segment index by NPE) Benefit[5]: save range-check for String index (catch malformed String indexes by IndexOutOfBoundsException) Benefit[6]: avoid 22 % superfluous memory and disk-footprint

(3) Truncate Strings (catch unmappable String indexes by IndexOutOfBoundsException): Benefit[7]: save another 4 % superfluous memory and disk-footprint

Note: All exceptions can be catched at once, as they are all of RuntimeException.

(4) Save mappings in data file (modified UTF-8-saved chars need 2.97 bytes in average): Benefit[8]: save modified UTF-8 decoding while loading class file Benefit[9]: avoid another 48 % superfluous disk-footprint Note: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6795536 ( I have just created patch, but I'm waiting for launch of OpenJDK-7 project "charset-enhancement".) Disadvantage[1]: loading data from jar-file may be slow, but ... - host data file outside of jar, as loading by nio.channel.FileChannel from direct buffer should be fast - resolve Bugs: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6818736 http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6818737

(5) Generate mappings as surrogate pairs: High surrogates could be saved as bytes and ANDed by 0xdc00, as they won't exceed 0xdc80 Benefit[10]: save decoding to surrogate pairs (I guess, this would significantly increase performance) Benefit[11]: save b2cIsSupp[] (saves another 4 % memory and disk-footprint) Disadvantage[2]: memory and disk-footprint would again increase by 50 %

(6) Change parameters of decode() method: static void decode(byte[] src, char[] dst, int sp, int sl, int dp, int dl, int p) ("beta" approach) speads up buffer access + avoids c1, c1 buffering Benefit[12]: increase performance Disadvantage[3]: need different methods for direct buffers

(7) Provide 4-way fork from de/encodeLoop(): See: https://java-nio-charset-enhanced.dev.java.net/source/browse/java-nio-charset-enhanced/trunk/src/sun/nio/cs/SingleByteEncoder_new.java?rev=&view=markup Benefit[13]: increase performance, if there is only 1 direct buffer

(8) Quit coders xBufferLoop by exception on xflow: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6806227 Benefit[14]: increase performance

(9) Get rid of sun.io package dependency:

https://java-nio-charset-enhanced.dev.java.net/source/browse/java-nio-charset-enhanced/tags/milestone2/src/sun/io/ Benefit[15]: avoid superfluous disk-footprint Benefit[16]: save maintenance of sun.io converters Disadvantage[4]: published under JRL (waiting for launch of OpenJDK-7 project "charset-enhancement") ;-)

*** Encoder-Suggestions (not complete, just some thoughts):

(11) Initialize encoder mappings lazy: Benefit[17]: increase startup performance for decoder

(12) Generate mappings for surrogate pairs: Benefit[18]: save encoding from surrogate pairs (I guess, this would significantly increase performance)

(13) Introduce 16-bit intermediate mapping ("beta"-thoughts: overall count of code points is < 65536): Benefit[19]: avoid superfluous memory and disk-footprint

-Ulf



More information about the core-libs-dev mailing list