[Python-Dev] Dicts are broken Was: unicode hell/mixing str and unicode asdictionarykeys (original) (raw)

"Martin v. Löwis" martin at v.loewis.de
Tue Aug 8 09:56:53 CEST 2006


M.-A. Lemburg schrieb:

Hiding programmer errors is not making life easier in the long run, so I'm -1 on having the equality comparison return False.

There is no error to hide here. The objects are inequal, period.

Instead we should generate a warning in Python 2.5 and introduce the exception in Python 2.6.

A warning about what? That you can't put byte string and Unicode strings into the same dictionary (as keys)? Next we start not allowing to put numbers and strings into the same dictionary, because there is no conversion defined between them?

In the above example, you clearly know that the two are unequal due to the relationship between complex numbers having an imaginary part and integers..

Right. And so I do when the byte string does not convert to Unicode.

However, this is not the case for 8-bit string vs. Unicode, since you cannot use such extra knowledge if you find that ASCII encoding assumption obviously doesn't match the string in question.

It's not the question "Could there be a conversion under which they are equal?" If you ask that question, then

py> "3"==3 False

should raise an exception, because there exists a conversion under which these objects are equal:

py> int("3")==3 True

It's just that, under the conversion Python applies, the byte string and the Unicode string are not equal.

Note that Python always coerces to the "bigger" type. As a result, the second option is what is actually implemented in Python. [which is decode-to-unicode]

It might be debatable which of the types is the "bigger" type. It's not that byte strings are a true subset of Unicode strings, under some conversion, since there are byte strings which have no Unicode equivalent (because they are not characters, and don't convert under the encoding), and there are Unicode strings that have no byte string equivalent.

For example, if the system encoding is UTF-8, then byte string is the bigger type (all Unicode strings convert to byte strings, but not all byte strings convert to Unicode strings).

However, this is a red herring: Python has, for whatever reason, chosen to convert byte->unicode, and nobody is questioning that choice.

I disagree with this part.

Failure to decode a string doesn't imply inequality.

If the failure is "these bytes don't have a meaningful character interpretation", then the bytes are clearly not equal to some character string.

It implies that the programmer needs to step in and correct the problem by making an explicit and conscious decision.

There is no problem to correct. The strings are inequal.

The alternative would be to decide that equal comparisons should never be allowed to raise exceptions and instead have the equal comparison return False.

There are many reasons why comparison could raise an exception. It could be out of memory, it could be that there is an internal/programming error in the codec being used, it could be that the codec is not found (likewise for other comparisons).

However, if the codec is working properly, and clearly determines that the byte string has no character string equivalent, then it can't be equal to some character (unicode) string.

Regards, Martin



More information about the Python-Dev mailing list