[Python-Dev] unicode hell/mixing str and unicode as dictionary keys (original) (raw)

Bob Ippolito bob at redivi.com
Thu Aug 3 19:03:08 CEST 2006


On Aug 3, 2006, at 9:51 AM, M.-A. Lemburg wrote:

Ralf Schmitt wrote:

Ralf Schmitt wrote:

Still trying to port our software. here's another thing I noticed:

d = {} d[u'm\xe1s'] = 1 d['m\xe1s'] = 1 print d With python 2.4 I can add those two keys to the dictionary and get: $ python2.4 t2.py {u'm\xe1s': 1, 'm\xe1s': 1} With python 2.5 I get: $ python2.5 t2.py Traceback (most recent call last): File "t2.py", line 3, in d['m\xe1s'] = 1 UnicodeDecodeError: 'ascii' codec can't decode byte 0xe1 in position 1: ordinal not in range(128) Is this intended behaviour? I guess this might break lots of programs and the way python 2.4 works looks right to me. I think it should be possible to mix str/unicode keys in dicts and let non-ascii strings compare not-equal to any unicode string. Also this behaviour makes your programs break randomly, that is, it will break when the string you add hashes to the same value that the unicode string has (at least that's what I guess..) This is because Unicode and 8-bit string keys only work in the same way if and only if they are plain ASCII. The reason lies in the hash function used by Unicode: it is crafted to make hash(u) == hash(s) for all ASCII s, such that s == u. For non-ASCII strings, there are no guarantees as to the hash value of the strings or whether they match or not. This has been like that since Unicode was introduced, so it's not new in Python 2.5.

What is new is that the exception raised on "u == s" after hash
collision is no longer silently swallowed.

-bob



More information about the Python-Dev mailing list