Rewrite of IBM doublebyte charsets (original) (raw)

Ulf Zibis Ulf.Zibis at gmx.de
Thu May 14 20:14:30 UTC 2009

Previous message: Rewrite of IBM doublebyte charsets
Next message: Rewrite of IBM doublebyte charsets
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Now I have time to answer more detailed ...

Am 12.05.2009 08:30, Xueming Shen schrieb:

For (2), I'm not convinced that this approach is an appropriate one for a complicated charset like EUCTW, given the number of array it carries, the recovery work (to trace back to what goes wrong and then return the appropriate CoderResult) will be complicated and redundant...).

Well, checking the range twice is also redundant (It's additionally checked behind the scenes on every array access by JVM).

This might have a benefit of saving the range check (I don't have any data to show how much we can gain from doing this, only a guess), but given almost all segments are near "full", I don't see the benefit on the footprint saving side. We need some hard data to support this approach, which I don't have for now. I would leave this one for you for further optimization in your project.

Yes, that's good idea. I would be happy, if it would be launched in the near future ...

I have updated the webrev to address some of your other optimization suggestions

Happy to see that. :-)

(1)No I don't think we want to save the supplementary into surrogate pair, this is what I'm trying to fix. We don't care the performance of surrogates, those codepoints are RARE used, 99%+ coding/decoding happens in BMP, we did not have the supplementary characters for the first couple years. (OK, I'm a native, I don't think I can even read those characters)

This is, what I didn't know. My assumption was, that those supplementary characters would be regularly used, as they are 137 % of BMP chars count. But if they are so rare used, wouldn't it be reasonable to split the mapping into 2 chunks, or even 3 chunks, having a base-chunk of about ~10 % of BMP. Your native status would help to discover those ~10 %. ;-) Well, such optimization would ideally placed in the mentioned project.

(2)The initialization c2b data for encoder has already been "lazied" until Encoder class gets loaded.

Oops, I oversaw this fact. ;-)

-Ulf

Previous message: Rewrite of IBM doublebyte charsets
Next message: Rewrite of IBM doublebyte charsets
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the core-libs-dev mailing list