[Python-Dev] unicode hell/mixing str and unicode as dictionary keys (original) (raw)

David Hopwood david.nospam.hopwood at blueyonder.co.uk
Mon Aug 7 16:57:15 CEST 2006


Michael Foord wrote:

Martin v. Löwis wrote:

[snip..] Expanding this view to Unicode should mean that a unicode string U equals a byte string B if U.encode(systemencode) == B or B.decode(systemencoding) == U, and that they don't equal otherwise (e.g. if the conversion fails with a "not convertible" exception).

I disagree. Unicode strings should always be considered distinct from non-ASCII byte strings. Implicitly encoding or decoding in order to perform a comparison is a bad idea; it is expensive and will often do the wrong thing.

The programmer should explicitly encode the Unicode string or decode the byte string before comparison (which one of these is correct is application-dependent).

Which of the two conversions is selected is arbitrary; [...]

It would not be arbitrary. In the common case where the byte encoding uses "precomposed" characters, using "U.encode(system_encoding) == B" will tend to succeed in more cases than "B.decode(system_encoding) == U", because alternative representations of the same abstract character in Unicode will be mapped to the same precomposed character.

(Whether these are cases in which the comparison should succeed is, as I said above, application-dependent.)

The special case of considering US-ASCII strings to compare equal to the corresponding Unicode string, is more reasonable than this would be for a general byte encoding, because:

we should, of course, continue to use the one we always used (for "ascii", there is no difference between the two). +1 This seems the most (only ?) logical solution.

No; always considering Unicode and non-ASCII byte strings to be distinct is just as logical.

-- David Hopwood <david.nospam.hopwood at blueyonder.co.uk>



More information about the Python-Dev mailing list