[Python-Dev] What to do for bytes in 2.6? (original) (raw)

glyph at divmod.com glyph at divmod.com
Sun Jan 20 08:49:56 CET 2008


On 04:26 am, guido at python.org wrote:

On Jan 19, 2008 5:54 PM, <glyph at divmod.com> wrote:

On 19 Jan, 07:32 pm, guido at python.org wrote:

Starting with the most relevant bit before getting off into digressions that may not interest most people:

Why can't we get that warning in -3 mode just the same from something read from a socket and a b"" literal?

If you really want this, please think through all the consequences, and report back here. While I have a hunch that it'll end up giving too many false positives and at the same time too many false negatives, perhaps I haven't thought it through enough. But if you really think this'll be important for you, I hope you'll be willing to do at least some of the thinking.

While I stand by my statement that unicode is the Right Way to do text in python, this particular feature isn't really that important, and I can see there are cases where it might cause problems or make life more difficult. I suspect that I won't really know whether I want the warning anyway before I've actually tried to port any nuanced, real text-processing code to 3.0, and it looks like it's going to be a little while before that happens. I suspect that if I do want the warning, it would be a feature for 2.7, not 2.6, so I don't want to waste a lot of everyone's time advocating for it.

Now for a nearly irrelevant digression (please feel free to stop reading here):

Now, ad-hoc code with a fast and loose definition of "text" can still read arrays of bytes off a socket without specifying an encoding and get away with it, but that's because Python's unicode implementation has thus far been very forgiving, not because the data is cleanly text yet. I would say that depends on the application, and on arrangements that client and server may have made off-line about the encoding.

I can see your point. I think it probably holds better on files and streams than on sockets, though - please forgive me if I don't think that server applications which require environment-dependent out-of-band arrangements about locale are correct :).

In 2.x, text can legitimately be represented as str -- there's even the locale module to further specify how it is to be interpreted as characters.

I'm aware that this specific example is kind of a ridiculous stretch, but it's the first one that came to mind. Consider len(u'é'.encode('utf-8').rjust(5).decode('utf-8')). Of course unicode.rjust() won't do the right thing in the case of surrogate pairs, not to mention RTL text, but it still handles a lot more cases than str.rjust(), since code points behave a lot more like characters than code units do.

Sure, this doesn't work for full unicode, and it doesn't work for all protocols used with sockets, but claiming that only fast and loose code ever uses str to represent text is quite far from reality -- this would be saying that the locale module is only for quick and dirty code, which just ain't so.

It would definitely be overreaching to say all code that uses str is quick and dirty. But I do think that it fits into one of two categories: quick and dirty, or legacy. locale is an example of a legacy case for which there is no replacement (that I'm aware of). Even if I were writing a totally unicode-clean application, as far as I'm aware, there's no common replacement for i.e. locale.currency().

Still, locale is limiting. It's ... uncomfortable to call locale.currency() in a multi-user server process. It would be nice if there were a replacement that completely separated encoding issues from localization issues.

I believe that a constraint should be that by default (without -3 or a future import) str and bytes should be the same thing. Or, another way of looking at this, reads from binary files and reads from sockets (and other similar things, like ctypes and mmap and the struct module, for example) should return str instances, not instances of a str subclass by default -- IMO returning a subclass is bound to break too much code. (Remember that there is still lots of code out there that uses "type(x) is types.StringType)" rather than "isinstance(x, str)", and while I'd be happy to warn about that in -3 mode if we could, I think it's unacceptable to break that in the default environment -- let it break in 3.0 instead.)

I agree. But, it's precisely because this is so subtle that it would be nice to have tools which would report warnings to help fix it. Certainly by default, everywhere that's "str" in 2.5 should be "str" in 2.6. Probably even in -3 mode, if the goal there is "warnings only". However, the feature still strikes me as potentially useful while porting. If I were going to advocate for it, though, it would be as a separate option, e.g. "--separate-bytes-type". I say this as separate from just trying to run the code on 3.0 to see what happens because it seems like the most subtle and difficult aspect of the port to get right; it would be nice to be able to tweak it individually, without the other issues related to 3.0. For example, some of the code I work on has a big stack of dependencies. Some of those are in C, most of them don't process text at all. However, most of them aren't going to port to 3.0 very early, but it would be good to start running in as 3.0-like of an environment as possible earlier than that so that the hard stuff is done by the time the full stack has been migrated.

I've written lots of code that aggressively rejects str() instances as text, as well as unicode instances as bytes, and that's in code that still supports 2.3 ;). Yeah, well, but remember, while keeping you happy is high on my list

Thanks, good to hear :)

of priorities, it's not the only priority. :-)

I don't think it's even my fiancée's only priority, and I think it should stay higher on her list than yours ;-).



More information about the Python-Dev mailing list