[Python-Dev] unicode hell/mixing str and unicode as dictionary keys (original) (raw)

David Hopwood david.nospam.hopwood at blueyonder.co.uk
Mon Aug 7 16:57:15 CEST 2006

Previous message: [Python-Dev] unicode hell/mixing str and unicode as dictionary keys
Next message: [Python-Dev] unicode hell/mixing str and unicode as dictionary keys
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Michael Foord wrote:

Martin v. Löwis wrote:

[snip..] Expanding this view to Unicode should mean that a unicode string U equals a byte string B if U.encode(systemencode) == B or B.decode(systemencoding) == U, and that they don't equal otherwise (e.g. if the conversion fails with a "not convertible" exception).

I disagree. Unicode strings should always be considered distinct from non-ASCII byte strings. Implicitly encoding or decoding in order to perform a comparison is a bad idea; it is expensive and will often do the wrong thing.

The programmer should explicitly encode the Unicode string or decode the byte string before comparison (which one of these is correct is application-dependent).

Which of the two conversions is selected is arbitrary; [...]

It would not be arbitrary. In the common case where the byte encoding uses "precomposed" characters, using "U.encode(system_encoding) == B" will tend to succeed in more cases than "B.decode(system_encoding) == U", because alternative representations of the same abstract character in Unicode will be mapped to the same precomposed character.

(Whether these are cases in which the comparison should succeed is, as I said above, application-dependent.)

The special case of considering US-ASCII strings to compare equal to the corresponding Unicode string, is more reasonable than this would be for a general byte encoding, because:

it can be done with no (or only a trivial) conversion,
US-ASCII has no precomposed characters or combining marks, so it does not have multiple encodings for the same abstract character,
Unicode has a US-ASCII subset that uses exactly the same encoding model as US-ASCII (whereas in general, a byte encoding might use an arbitrarily different encoding model to Unicode, as for example is the case for ISCII).

we should, of course, continue to use the one we always used (for "ascii", there is no difference between the two). +1 This seems the most (only ?) logical solution.

No; always considering Unicode and non-ASCII byte strings to be distinct is just as logical.

-- David Hopwood <david.nospam.hopwood at blueyonder.co.uk>

Previous message: [Python-Dev] unicode hell/mixing str and unicode as dictionary keys
Next message: [Python-Dev] unicode hell/mixing str and unicode as dictionary keys
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list