[Python-Dev] unicode_internal codec and the PEP 393 (original) (raw)

Victor Stinner victor.stinner at haypocalc.com
Wed Nov 9 22:49:35 CET 2011


Le mercredi 9 novembre 2011 22:03:52, vous avez écrit :

> Should we: > * Drop this codec (public and documented, but I don't know if it is > used) * Use wchart* (PyUNICODE*) to provide a result similar to > Python 3.2, and > > so fix the decoder to handle surrogate pairs > > * Use the real representation (PyUCS1*, PyUCS2 or PyUCS4* string) It's described as "Return the internal representation of the operand". That would suggest that the last choice (i.e. return the real internal representation) would be best, except that this doesn't round-trip. Adding a prefix byte indicating the kind (and perhaps also the ASCII flag) would then be closest to the real representation. As that is likely not very useful, and might break some applications of the encoding (if there are any at all) which might expect to pass unicode-internal strings across Python versions, I would then also deprecate the encoding.

After a quick search on Google codesearch (before it disappears!), I don't think that "encoding" a Unicode string to its internal PEP-393 representation would satisfy any program. It looks like wchar_t* is a better candidate. Programs use maybe unicode_internal to decode strings coming from libraries using wchar_t* (and no PyUnicodeObject).

taskcoach, drag & drop code using wxPython:

 data = self.__thunderbirdMailDataObject.GetData()
 # We expect the data to be encoded with 'unicode_internal',
 # but on Fedora it can also be 'utf-16', be prepared:
 try:
      data = data.decode('unicode_internal')
 except UnicodeDecodeError:
      data = data.decode('utf-16')

=> thunderbirdMailDataObject.GetData() result type should be a Unicode, not bytes

hydrat, tokenizer:

 def bytes(str):
     return filter(lambda x: x != '\x00', str.encode('unicode_internal'))

=> this algorithm is really strange...

djebel, fscache/rst.py

 class RstDocument(object):
     ...
     def __init__(self, path, options={}):
         opts = {'input_encoding': 'euc-jp',
                 'output_encoding': 'unicode_internal',
                 'doctitle_xform': True,
                 'file_insertion_enabled': True}
         ...
         doctree = core.publish_doctree(source=file(path, 'rb').read(),
                                        ...,
                                        settings_overrides=opts)
         ...
         content = parts['html_body'] or u''
         if not isinstance(content, unicode):
             content = unicode(content, 'unicode_internal')
         if not isinstance(title, unicode):
             title = unicode(title, 'unicode_internal')
         ...

=> I don't understand this code

Victor



More information about the Python-Dev mailing list