[Python-3000] Unicode and OS strings (original) (raw)

Stephen J. Turnbull stephen at xemacs.org
Tue Sep 18 06:56:37 CEST 2007


"Marcin 'Qrczak' Kowalczyk" <qrczak at knm.org.pl> writes:

When a codec encounters something it can't handle, whether it's a valid character in a legacy encoding, a private use character in a UTF, or an invalid sequence of code units, it throws an exception specifying the character or code unit and the current coded character set,

Does this mean that this: $ python -c 'import sys; print("%x" % ord(sys.argv[1]))' $(printf "\ue650") would no longer print e650 in a UTF-8 locale

What do you mean "no longer"? Look:

chibi:MacPorts steve$ export LC_ALL=en_US.UTF-8 chibi:MacPorts steve$ python -c 'import sys; print("%s" % sys.argv[1])' $(printf "\ue650") \ue650 chibi:MacPorts steve$ python -c 'import sys; print("%x" % ord(sys.argv[1]))' $(printf "\ue650") Traceback (most recent call last): File "", line 1, in ? TypeError: ord() expected a character, but string of length 6 found chibi:MacPorts steve$

Note that some people are currently arguing that sys.argv should be an array of bytes objects, and Guido has not yet said "no". In that case, all of the current proposals should have exactly this result.

My position is that if you do something that depends on the internal representation of implementation-dependent objects, you deserve whatever results you get.



More information about the Python-3000 mailing list