[Python-3000] BOM handling (original) (raw)

Antoine Pitrou solipsis at pitrou.net
Wed Sep 13 22:33:22 CEST 2006


Le mercredi 13 septembre 2006 à 09:41 -0700, Josiah Carlson a écrit :

And is generally ignored, as per unicode spec; it's a "zero width non-breaking space" - an invisible character with no effect on wrapping or otherwise.

Well it would be better if Py3K (with all strings unicode) makes things easy for the programmer and abstracts away those "invisible characters with no textual meaning". Currently it's not the case:

a = "hello".decode("utf-8") b = (codecs.BOMUTF8 + "hello").decode("utf-8") len(a) 5 len(b) 6 a == b False

a = "hello".encode("utf-16le").decode("utf-16le") b = (codecs.BOMUTF16LE + "hello".encode("utf-16le")).decode("utf-16le") len(a) 5 len(b) 6 a == b False a u'hello' b u'\ufeffhello' print a hello print b Traceback (most recent call last): File "", line 1, in ? File "/usr/lib/python2.4/encodings/iso8859_15.py", line 18, in encode return codecs.charmap_encode(input,errors,encoding_map) UnicodeEncodeError: 'charmap' codec can't encode character u'\ufeff' in position 0: character maps to

Regards

Antoine.



More information about the Python-3000 mailing list