[Python-Dev] Multilingual programming article on the Red Hat Developer blog (original) (raw)

R. David Murray rdmurray at bitdance.com
Tue Sep 16 21:29:30 CEST 2014


On Wed, 17 Sep 2014 04:02:11 +1000, Chris Angelico <rosuav at gmail.com> wrote:

On Wed, Sep 17, 2014 at 3:46 AM, R. David Murray <rdmurray at bitdance.com> wrote: >> You can't treat them as characters, so while you have them in your >> string, you can't treat it as a pure Unicode string - it''s a Unicode >> string with smuggled bytes. > > Well, except that I do. The email header parsing algorithms all work > fine if I treat the surrogate escaped bytes as 'unknown junk' and just > parse based on the valid unicode. (Unless the header is so garbled that > it can't be parsed, of course, at which point it becomes an invalid > header).

Do what, exactly? As I understand you, you treat the unknown bytes as completely opaque, not representing any characters at all. Which is what I'm saying: those are not characters.

Yes. I thought you were saying that one could not treat the string with smuggled bytes as if it were a string. (It's a string that can't be encoded unless you use the surrogateescape error handler, but it is still a string from Python's POV, which is the point of the error handler).

Or, to put it another way, your implication was that there were no string operations that could be usefully applied to a string containing smuggled bytes, but that is not the case. (I may well have read an implication that was not there; if so I apologize and you can ignore the rest of this :) Basically, we are pretending that the each smuggled byte is single character for string parsing purposes...but they don't match any of our parsing constants. They are all "any character" matches in the regexes and what have you. Of course, this only works in contexts where we can ignore or "carry along" the smuggled bytes as being components of "arbitrary text" portions of the syntax, and we must take care to either replace them with valid unicode error glyphs or turn the string of which the are a part into binary using the same codec and error handler as we used to ingest them to begin with before emitting them. And, of course, we can't modify the sections containing the smuggled bytes, only the syntax-matched sections that surround them; and things like line wrapping are just an invitation to ugliness and bugs even if you kept the smuggled bytes sections internally intact.

Finally, to explain what I meant by "except that I do": when I added back binary support to the email package in Python3, initially I did not change the parsing algorithms in the code. I just smuggled the bytes, and then dealt with the encoding/decoding at the API boundaries. This is the same principle used when dealing with filenames in the API of Python itself. Except at that boundary, I do not need to worry about whether a particular string contains smuggled bytes or not.[*]

If you, instead, represented the header as a list with some str elements and some bytes, it would be just as valid (though much harder to work with); all your manipulations are done on the str parts, and the bytes just tag along for the ride.

Quite a bit harder, which is why I don't do that.

> You are right about the wrapping, though. If a header with invalid > bytes (and in this scenario we are talking about errors) needs to > be wrapped, we have to first decode the smuggled bytes and turn it > into an 'unknown-8bit' encoded word before we can wrap the header.

Yeah, and that's going to be a bit messy. If you get 60 characters followed by 30 unknown bytes, where do you wrap it? Dare you wrap in the middle of the smuggled section?

The point of RFC2047 encoded words is that they are an ASCII representation of binary data, so once the bytes are "properly" Content Transfer Encoded (as being in an unknown charset) the string contains no smuggled bytes and can be wrapped.

--David

[*] I worried a lot that this was re-introducing the bytes/string problem from python2. The difference is that if the smuggled bytes escape from the email API, that's a bug in the email package. So user code using the library is not in danger of getting mysterious encoding errors when one day the input is international where before it was all ASCII. (Absent bugs in the library.)



More information about the Python-Dev mailing list