[Python-3000] bytes and dicts (was: PEP 3137: Immutable Bytesand Mutable Buffer) (original) (raw)

Phillip J. Eby pje at telecommunity.com
Sat Sep 29 17:14:04 CEST 2007


At 07:33 AM 9/29/2007 -0700, Guido van Rossum wrote:

Until just before 3.0a1, they were unequal. We decided to raise TypeError because we noticed many bugs in code that was doing things like

data = f.read(4096) if data == "": break

Thought experiment: what if read() always returned strings, and to read bytes, you had to use something like 'f.readinto(ob, 4096)', where 'ob' is a mutable bytes instance or memory view?

In Python 2.x, there's only one read() method because (prior to unicode), there was only one type of reading to do.

But as the above example makes clear, in 3.x you simply can't write code that works correctly with an arbitrary file that might be binary or text, at least not without typechecking the return value from read(). (In which case, you might as well inspect the file object.) So, the above problem could be fixed by having .read() raise an error (or simply not exist) on a binary file object.

In this way, the problem is fixed at the point where it really occurs: i.e., at the point of not having decided whether the stream is bytes or text.

This also seems to fit better (IMO) with the best practice of enforcing str/unicode/encoding distinctions at the point where data enters the program, rather than delaying the error to later.

I thought about using warning too, but since nobody wants warnings, that would be pretty much the same as raising TypeError except for the most dedicated individuals (and if I were really dedicated I'd just write my own eq() function anyway).

The use case I'm concerned about is code that's not type-specific getting a TypeError by comparing arbitrary objects. For example, if you write Python code to create a Python code object (e.g. the compiler package or my own BytecodeAssembler), you need to create a list of constants as you generate the code, and you need to be able to search the list for an equal constant. Since strings and bytes can both be constants, a simple list.index() test could now raise a TypeError, as could "item in list".

So raising an error to make bad code fail sooner, will also take down unsuspecting code that isn't really broken, and force the writing of special comparison code -- which won't be usable with things like list.remove and the "in" operator.

In comparison, forcing code to be bytes vs. text aware at the point of I/O directs attention to the place where you can best decide what to do about it. (After all, the comparison that raises the TypeError might occur deep in a library that's expecting to work with text.)

And the warning would do nothing about the issue brought up by Jim Jewett, the unpredictable behavior of a dict with both bytes and strings as keys.

I've looked at all of Jim's messages for September, but I don't see this. I do see where raising TypeError for comparisons causes a problem with dictionaries, but I don't see how an unequal comparison creates "unpredictable" behavior (as opposed to predictable failure to match).



More information about the Python-3000 mailing list