[Python-Dev] Dicts are broken Was: unicode hell/mixing str and unicode asdictionarykeys (original) (raw)
mal mal at lemburg.com
Tue Aug 8 09:22:01 CEST 2006
- Previous message: [Python-Dev] Dicts are broken Was: unicode hell/mixing str and unicode asdictionarykeys
- Next message: [Python-Dev] Dicts are broken Was: unicode hell/mixing str and unicode asdictionarykeys
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Martin v. Löwis wrote:
M.-A. Lemburg schrieb:
Python just doesn't know the encoding of the 8-bit string, so can't make any assumptions on it. As result, it raises an exception to inform the programmer. Oh, Python does make an assumption what the encoding is: it assumes it is the system encoding (i.e. "ascii"). Then invoking the ascii codec raises an exception, because the string clearly isn't ascii.
Right, and as consequence, Python raises an exception to let the programmer correct the problem.
The subsequent solution to the problem may result in the string being decoded into Unicode and the two resulting Unicode objects being unequal, or it may also result in them being equal. Python doesn't have this knowledge, so always returning false is clearly wrong.
Hiding programmer errors is not making life easier in the long run, so I'm -1 on having the equality comparison return False.
Instead we should generate a warning in Python 2.5 and introduce the exception in Python 2.6.
Note that you do have to interpret the string as characters > if you compare it to Unicode and there's nothing wrong with > that. Consider this: py> int(3+4j) Traceback (most recent call last): File "", line 1, in ? TypeError: can't convert complex to int; use int(abs(z)) py> 3 == 3+4j False So even though the conversion raises an exception, the values are determined to be not equal. Again, because int is a nearly true subset of complex, the conversion goes the other way, but if it would use the complex->int conversion, then the TypeError should be taken as a guarantee that the objects don't compare equal.
In the above example, you clearly know that the two are unequal due to the relationship between complex numbers having an imaginary part and integers..
The same is true for the overflow case:
2**10000 == 1.23 False float(2**10000) Traceback (most recent call last): File "", line 1, in ? OverflowError: long int too large to convert to float
(Note that in Python 2.3 this used to raise an exception as well.)
However, this is not the case for 8-bit string vs. Unicode, since you cannot use such extra knowledge if you find that ASCII encoding assumption obviously doesn't match the string in question.
Expanding this view to Unicode should mean that a unicode string U equals a byte string B if U.encode(systemencode) == B or B.decode(systemencoding) == U, and that they don't equal otherwise
Agreed.
Note that Python always coerces to the "bigger" type. As a result, the second option is what is actually implemented in Python.
(e.g. if the conversion fails with a "not convertible" exception).
I disagree with this part.
Failure to decode a string doesn't imply inequality. It implies that the programmer needs to step in and correct the problem by making an explicit and conscious decision.
The alternative would be to decide that equal comparisons should never be allowed to raise exceptions and instead have the equal comparison return False. In which case, we'd have the revert the dict patch altogether and instead silence all exceptions that are generated during the equal comparison (not only in the dict implementation), replacing them with a False return value.
-- Marc-Andre Lemburg eGenix.com
Professional Python Services directly from the Source (#1, Aug 07 2006)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::
- Previous message: [Python-Dev] Dicts are broken Was: unicode hell/mixing str and unicode asdictionarykeys
- Next message: [Python-Dev] Dicts are broken Was: unicode hell/mixing str and unicode asdictionarykeys
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]