[Python-Dev] bytes (original) (raw)

[Python-Dev] bytes / unicode

Stephen J. Turnbull stephen at xemacs.org
Wed Jun 23 09:07:50 CEST 2010

Previous message: [Python-Dev] bytes / unicode
Next message: [Python-Dev] bytes / unicode
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

James Y Knight writes:

The surrogateescape method is a nice workaround for this, but I can't
help thinking that it might've been better to just treat stuff as
possibly-invalid-but-probably-utf8 byte-strings from input, through
processing, to output.

This is the world we already have, modulo s/utf8/ascii + random GR charset/. It doesn't work, and it can't, in Japan or China or Korea, and probably not in Russia or Kazakhstan, for some time yet.

That's not to say that byte-oriented processing doesn't have its place. And in many cases it's reasonable (but not secure or bulletproof!) to assume ASCII compatibility of the byte stream, passing through syntactically unimportant bytes verbatim. Syntactic analysis of such streams will surely have a lot in common with that for text streams, so the same tools should be available. (That's the point of Guido's endorsement of polymorphism, AIUI.)

But it's just not reasonable to assume that will work in a context where text streams from various sources are mixed with byte streams. In that case, the byte streams need to be converted to text before mixing. (You can't do it the other way around because there is no guarantee that the text is compatible with the current encoding of the byte stream, nor that all the byte streams have the same encoding.)

We do need str-based implementations of modules like urllib.

Previous message: [Python-Dev] bytes / unicode
Next message: [Python-Dev] bytes / unicode
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list