Codereview request for 6653797: Reimplement JDK charset repository charsets.jar (original) (raw)
Xueming Shen xueming.shen at oracle.com
Mon Jul 16 16:59:13 UTC 2012
- Previous message: Codereview request for 6653797: Reimplement JDK charset repository charsets.jar
- Next message: Codereview request for 6653797: Reimplement JDK charset repository charsets.jar
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On 7/16/2012 9:30 AM, Ulf Zibis wrote:
Hi Sherman,
as I just said for 7183053, I can't look in the details at the moment, as I do not have suitable environment installed at the moment. Just one comment: I think it should be possible to share the mapping data partly across charsets, so the charsets.jar would be decreased again more?
Yes, it might be desirable to share some of the mappings, especially among those variants. But as I suggested at the very beginning of the project, the priority for now is to move all the charsets to the new mapping based/build-time generated implementation, then it opens the door for new optimization, any improvement on those base classes and the "generator" tools (to share the mapping, for example) will be shared by all the sub-classes/classes. While it might be ideal to achieve all the goals at one shot, our resource restrict really does not allow me to spend most of my time on it (mapping re-generate really takes time and I have to test from various angles to make sure it does not break anything and not miss any corner case). This is more like a side-project (sure I do have a JEP for it but...) for now and I just found two "spare" weeks to push these two RFEs out. I might have more time on charsets later around the end development stage of JDK8.
-Sherman
-Ulf
Am 16.07.2012 00:12, schrieb Xueming Shen: Hi
This changeset includes the migration of our JIS0201/0208/0212 based single/ double-byte charsets to the new mapping based implementation. This is the left-over of the effort we put in JDK7 to re-implement most of our charsets in charsets.jar to (1)have better performance (2) small storage foot print and (3) ease the maintenance burden. http://cr.openjdk.java.net/~sherman/6653797/webrev/ Notes of the implementation: (1) jis0201/0208/0212 and their variants are now generated from the mapping table during the build time. (See those new .map *.nr and *.c2b tables) (2) EUCJP/LINUXOPEN, SJIS, PCK, ISO2022JP and its variants are now using these new jis0201/02080212 charsets. (3) Those in red (in webrev) are the old implementation, since no charset uses them anymore, I removed them from the repository) (4) There are two approaches for PCK and SJIS. PCK.javav1 and SJIS.javav1 are the one that follows the old implementation, which decode/encodes base on the jis0201/0208 (and the variants) mapping via Ken's algorithm. This is known to be slow and buggy (the algothrim does not take care of illegal sjis cp, see #6653797 and http://cr.openjdk.java.net/~sherman/6653797/Sjis2Jis.java) So in this charset, I generated the direct mapping tables for sjis and pck and use the "general" DoubleByte base class for these two charsets. This results in much faster decoding/encoding and correct mapping for all code points. The downside of this approach is that it adds about 50k uncompressed side to the charsets.jar. But given this change actually reduces about 300K from the rt.jar, we still get a net 250K, so I decided to go with this approach for better performance. It appears to be lots of files (80+) in the webrev, but that number includes the removed old implementation and the tests I put in to guarantee the identical de/encoding result from the old and new implementations (those OLD... test cases), the change is actually not that big:-) So please help review. I can then put this multi-year efforts into rest. -Sherman
- Previous message: Codereview request for 6653797: Reimplement JDK charset repository charsets.jar
- Next message: Codereview request for 6653797: Reimplement JDK charset repository charsets.jar
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]