Rewrite of IBM doublebyte charsets (original) (raw)

Ulf Zibis Ulf.Zibis at gmx.de
Sun May 10 22:57:26 UTC 2009


Completed ...:

*** Decoder-Suggestions:

(10) Split map data files into chunks and load lazy. TW native speakers must be consulted, to define reasonable chunks! Benefit[17]: save startup time Benefit[18]: save memory

(11) Use java.util.BitSet for b2cIsSupp Benefit[19]: save memory, maybe faster

*** Encoder-Suggestions:

(21) Initialize encoder mappings lazy, maybe split into reasonable chunks: Benefit[21]: increase startup performance for de/encoder

(21) Save c2b and c2bPlane in 2-dimensional array: char[][] c2b = new char[0x100][] only instantiate actually used segments: c2b[x] = new char[0x100] Benefit[22]: save lookup and calculation of index, but add 1 indirection Benefit[23]: save range-check for segment index (catch malformed segment index by NPE) Benefit[24]: save c2bIndex

(22) In case of surrogate code points, use high surrogate (8 lower bits) as segment index: char[][] c2bSupp = new char[0x100][] only instantiate actually used segments: c2bSupp[x] = new char[0x400] Benefit[25]: save encoding to UC4 from surrogate pairs (I guess, this would significantly increase performance) Benefit[26]: save lookup and calculation of index, but add 1 indirection Benefit[27]: save range-check for segment index (catch malformed segment index by NPE) Benefit[28]: save c2bSuppIndex

(23) Truncate c2b segments: c2b[x] = new char[usedLength] (usedLength values could be generated and saved in EUC_TWMapping or data file) Benefit[29]: avoid superfluous memory and disk-footprint (I guess ~30 %) Benefit[30]: don't range-check in-segment index, catch unmappable index by IndexOutOfBoundsException

(24) Additinally truncate leading unmappables in c2b segments, and host offsets: Benefit[31]: avoid another superfluous memory and disk-footprint (I guess ~10 %) Disadvantage[21]: needs hosting of offsets: 256 bytes

(25) Concerning (23),(24): Check out best segment size (maybe 256 is not optimal): Benefit[32]: avoid another superfluous memory and disk-footprint (I guess 10-20 %)

(26) Concerning (22),(23),(24): maybe use 3-dim array and check out best segment size (maybe 10 bit is not optimal): Benefit[33]: avoid another superfluous memory and disk-footprint (I guess 10-20 %)

(27) Save Plane no. as 0x0, 0x2 .. 0x7 and 0xf: Benefit[34]: simplify calculation of 2nd byte, increases performance

(28) Save 2nd byte in c2bPlane directly (0xa2 .. 0xa7 and 0xaf) instead of Plane no.: Benefit[35]: save calculation of 2nd byte, increases performance Disadvantage[22]: increases c2bPlane by ~73%

-Ulf

EUC_TW statistics (updated):

Plane range length segments segments-usage-ratio

0 a1a1-fdcb 5868 5d = 93 66 % _0 a1a1-a744 434 7 = 7 65 % _1 c2a1-fdcb 5434 3c = 60 95 %

1:8ea2 -f2c4 7650 52 = 82 98 % 2:8ea3 -e7aa 6394 47 = 71 95 % 3:8ea4 -eedc 7286 4e = 78 98 % 4:8ea5 -fcd1 8601 5c = 92 98 % 5:8ea6 -e4fa 6385 44 = 68 99 % 6:8ea7 -ebd5 6532 4b = 75 98 % 7:8eaf -edb9 8721 4d = 77 92 %

Sum: 55446 262 = 610

max b1 range: 5d = 93 max b2 range: 5e = 94

memory amount for all segments (not truncated): 610 * 94 = 57,340 code points truncated -4 % : ~55,000 code points decoder surrogate mapping (*3): ~165,000 bytes

disk-footprint of EUC_TWMapping (1. Approach from Sherman): b2c : 8 * 94 * 94 * 2.97 = 209,943 Bytes b2cIsSuppStr : 94 * 94 * 1.48 = 13,077 c2bIndex : 256 * 7 = 1,792 c2bSuppIndex : 256 * 7 = 1,792 Sum ~227,000 Bytes

memory of EUC_TW (1. Approach from Sherman): b2c : 8 * 94 * 94 * 2 = 141,376 Bytes b2cIsSupp : 94 * 94 = 8,836 decoder sum : 150,212 c2b : 31744 * 2 = 63,488 c2bIndex : 256 * 2 = 512 c2bSupp : 43520 * 2 = 87,040 c2bSuppIndex : 256 * 2 = 512 c2bPlane : 43520 * 1 = 43,520 encoder sum : 195,072



More information about the core-libs-dev mailing list