Rewrite of IBM doublebyte charsets (original) (raw)
Xueming Shen Xueming.Shen at Sun.COM
Tue May 12 06:30:53 UTC 2009
- Previous message: Rewrite of IBM doublebyte charsets
- Next message: Rewrite of IBM doublebyte charsets
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Ulf, thanks for taking time reviewing the changes.
It seems most of the suggestions are based on two approaches you adopted in your charset-improvement project
(1)extract the mapping data into a binary data file and load in the data as the resource file during initialization of charset decoder/encoder
(2)totally depends on catching the IndexOutOfBoundsException to avoid sanity check on any array index when access the array data
For (1), as I explained last time, the 30% more size saving is attractive, but the magnitude of the charset startup/ initialization regression (reading data from the resource inputstream is much slower than loading in the data as a constant String object of a class file) is not acceptable. I do have a version of implementation that goes that approach, I even sent the webrev to Alan and Masayoshi for review, I "recalled" the review request after realized the startup regression. That was 6 months ago (I uploaded the webrev at http://cr.openjdk.java.net/~sherman/Old_newEUC_TW.) So for now it's not a choice for me to extract the data out.
For (2), I'm not convinced that this approach is an appropriate one for a complicated charset like EUC_TW, given the number of array it carries, the recovery work (to trace back to what goes wrong and then return the appropriate CoderResult) will be complicated and redundant...). This might have a benefit of saving the range check (I don't have any data to show how much we can gain from doing this, only a guess), but given almost all segments are near "full", I don't see the benefit on the footprint saving side. We need some hard data to support this approach, which I don't have for now. I would leave this one for you for further optimization in your project.
I have updated the webrev to address some of your other optimization suggestions
(1)simplify the "plane number" byte check by adding a new static array of cnspToIndex[16] for decoder (2)only do the isLeagalDB check when lookup failed (assume most bytes fed are legal...) for decoder (3)save plane no as 0x0, 0x2...0x7 and 0xf for encoder (this helps save the plane number check during encoding).
Something I don't think I'm going to do
(1)No I don't think we want to save the supplementary into surrogate pair, this is what I'm trying to fix. We don't care the performance of surrogates, those codepoints are RARE used, 99%+ coding/decoding happens in BMP, we did not have the supplementary characters for the first couple years. (OK, I'm a native, I don't think I can even read those characters)
(2)The initialization c2b data for encoder has already been "lazied" until Encoder class gets loaded.
(3)I doubt use BitSet for b2cIsSupp will save any memory, or faster. b2cIsSupp has been fully utilized to use each/every bit of its storage.
I gave the latest version a "quick" run, I'm happy with the performance gain, especially the encoding.
-client Encoding TimeRatio EUC_TW_OLD/x-EUC-TW: 84927,10151 :8.366368d Decoding TimeRatio EUC_TW_OLD/x-EUC-TW: 13162,9237 :1.424922d Encoding(dir) TimeRatio EUC_TW_OLD/x-EUC-TW: 94829,19950 :4.753333d Decoding(dir) TimeRatio EUC_TW_OLD/x-EUC-TW: 21884,17992 :1.216318d
-server
Encoding TimeRatio EUC_TW_OLD/x-EUC-TW: 74651,12218 :6.109920d
Decoding TimeRatio EUC_TW_OLD/x-EUC-TW: 11686,10364 :1.127557d
Encoding(dir) TimeRatio EUC_TW_OLD/x-EUC-TW: 79058,12187 :6.487076d
Decoding(dir) TimeRatio EUC_TW_OLD/x-EUC-TW: 13518,10198 :1.325554d
The latest webrev is at
http://cr.openjdk.java.net/~sherman/6831794_6229811/webrev/
Thanks, Sherman
Ulf Zibis wrote:
Completed ...:
*** Decoder-Suggestions: (10) Split map data files into chunks and load lazy. TW native speakers must be consulted, to define reasonable chunks! Benefit[17]: save startup time Benefit[18]: save memory (11) Use java.util.BitSet for b2cIsSupp Benefit[19]: save memory, maybe faster *** Encoder-Suggestions: (21) Initialize encoder mappings lazy, maybe split into reasonable chunks: Benefit[21]: increase startup performance for de/encoder (21) Save c2b and c2bPlane in 2-dimensional array: char[][] c2b = new char[0x100][] only instantiate actually used segments: c2b[x] = new char[0x100] Benefit[22]: save lookup and calculation of index, but add 1 indirection Benefit[23]: save range-check for segment index (catch malformed segment index by NPE) Benefit[24]: save c2bIndex (22) In case of surrogate code points, use high surrogate (8 lower bits) as segment index: char[][] c2bSupp = new char[0x100][] only instantiate actually used segments: c2bSupp[x] = new char[0x400] Benefit[25]: save encoding to UC4 from surrogate pairs (I guess, this would significantly increase performance) Benefit[26]: save lookup and calculation of index, but add 1 indirection Benefit[27]: save range-check for segment index (catch malformed segment index by NPE) Benefit[28]: save c2bSuppIndex (23) Truncate c2b segments: c2b[x] = new char[usedLength] (usedLength values could be generated and saved in EUCTWMapping or data file) Benefit[29]: avoid superfluous memory and disk-footprint (I guess ~30 %) Benefit[30]: don't range-check in-segment index, catch unmappable index by IndexOutOfBoundsException (24) Additinally truncate leading unmappables in c2b segments, and host offsets: Benefit[31]: avoid another superfluous memory and disk-footprint (I guess ~10 %) Disadvantage[21]: needs hosting of offsets: 256 bytes (25) Concerning (23),(24): Check out best segment size (maybe 256 is not optimal): Benefit[32]: avoid another superfluous memory and disk-footprint (I guess 10-20 %) (26) Concerning (22),(23),(24): maybe use 3-dim array and check out best segment size (maybe 10 bit is not optimal): Benefit[33]: avoid another superfluous memory and disk-footprint (I guess 10-20 %) (27) Save Plane no. as 0x0, 0x2 .. 0x7 and 0xf: Benefit[34]: simplify calculation of 2nd byte, increases performance (28) Save 2nd byte in c2bPlane directly (0xa2 .. 0xa7 and 0xaf) instead of Plane no.: Benefit[35]: save calculation of 2nd byte, increases performance Disadvantage[22]: increases c2bPlane by ~73% -Ulf EUCTW statistics (updated): Plane range length segments segments-usage-ratio 0 a1a1-fdcb 5868 5d = 93 66 % 0 a1a1-a744 434 7 = 7 65 % 1 c2a1-fdcb 5434 3c = 60 95 % 1:8ea2 -f2c4 7650 52 = 82 98 % 2:8ea3 -e7aa 6394 47 = 71 95 % 3:8ea4 -eedc 7286 4e = 78 98 % 4:8ea5 -fcd1 8601 5c = 92 98 % 5:8ea6 -e4fa 6385 44 = 68 99 % 6:8ea7 -ebd5 6532 4b = 75 98 % 7:8eaf -edb9 8721 4d = 77 92 % Sum: 55446 262 = 610 max b1 range: 5d = 93 max b2 range: 5e = 94 memory amount for all segments (not truncated): 610 * 94 = 57,340 code points truncated -4 % : ~55,000 code points decoder surrogate mapping (*3): ~165,000 bytes disk-footprint of EUCTWMapping (1. Approach from Sherman): b2c : 8 * 94 * 94 * 2.97 = 209,943 Bytes b2cIsSuppStr : 94 * 94 * 1.48 = 13,077 c2bIndex : 256 * 7 = 1,792 c2bSuppIndex : 256 * 7 = 1,792 Sum ~227,000 Bytes memory of EUCTW (1. Approach from Sherman): b2c : 8 * 94 * 94 * 2 = 141,376 Bytes b2cIsSupp : 94 * 94 = 8,836 decoder sum : 150,212 c2b : 31744 * 2 = 63,488 c2bIndex : 256 * 2 = 512 c2bSupp : 43520 * 2 = 87,040 c2bSuppIndex : 256 * 2 = 512 c2bPlane : 43520 * 1 = 43,520 encoder sum : 195,072
- Previous message: Rewrite of IBM doublebyte charsets
- Next message: Rewrite of IBM doublebyte charsets
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]