Issue 713820: iconv_codec NG - Python tracker (original) (raw)
This new implementation of iconv_codec resolves problems of current implementations below:
Having a reentrant context vulnerable point: encoder and/or decoder can be called multiple level in a same time when PEP293 codec error callback can call another iconv encoder session, too. So, all encode/ decode session must open their own iconv session but the current implementation shares the iconv session in the whole codec life time.
StreamReader can't work correctly: Because iconv keeps their context private, StreamReader can't work smart only with encode/decode function. Also, handling EINVAL and giving pending characters from previous data to error callback is very weak in the current implementation.
Putting a replacement character as just '?' is not safe for many encodings: On stateful encodings and non-byte stream encodings, we need to encode with iconv even for the replacement character.
Can't use encoding names including - and uppercases: Because codec subsystem changes - to _ and uppercases to lowercases, we can't pass them to the iconv_codec module without loss. For example, we need the next aliases to use CJK encodings on Sun iconv:
simplified chinese
"euc_cn": "zh_CN.euc", "iso_2022_zh": "zh_CN.iso2022-CN", "gbk": "zh_CN.gbk", "cp935": "zh_CN-cp935",
traditional chinese
"euc_tw": "zh_TW.euc", "iso_2022_tw": "zh_TW.iso2022-7", "big5": "zh_TW.big5", "cp937": "zh_TW.cp937",
japanese
"iso_2022_jp": "ISO-2022-JP", "euc_jp": "eucJP", "shift_jis": "PCK",
korean
"euc_kr": "ko_KR.euc", "iso_2022_kr": "ISO-2022-KR", "johab": "ko_KR.johap", "cp932": "ko_KR.cp932", "cp949": "ko_KR.cp949",
- Can't try multiple unicode encodings or methods: On some iconv implementations like of HP-UX or Solaris, UCS2 -> ISO-8859-1 is available but UCS2 -> euc-kr isn't avaiable and only UTF-8 -> euc-kr is.
And, many multibyte codecs such as CJK or iconv might have duplicated code for processing error callbacks and handling Streams. So, I splitted them out to another source. CJK and iconv codecs can share them just in source level by putting multibytecodec.c to Modules/ and linking the file to each of the codecs. Alternatively, if multibytecodec.c goes to Python/ and is linked to main python library, the codecs can be compiled and loaded by themselves. multibytecodec.c, the common multibyte codec framework can be used by any usual multibyte encodings. By using it, some codec writer can create a codec for his/her multibyte encodings without any care for handling error callbacks or implementing StreamReader structure. I wrote CJK codecs using it. and will submit a patch in an individual patch report.