[Python-Dev] Patch making the current email package (mostly) support bytes (original) (raw)

Stephen J. Turnbull stephen at xemacs.org
Tue Oct 5 07:41:12 CEST 2010


R. David Murray writes:

On Mon, 04 Oct 2010 12:32:26 -0400, Scott Dial <scott+python-dev at scottdial.com> wrote:

On 10/2/2010 7:00 PM, R. David Murray wrote:

The clever hack (thanks ultimately to Martin) is to accept 8bit data by encoding it using the ASCII codec and the surrogateescape error handler.

I've seen this idea pop up in a number of threads. I worry that you are all inventing a new kind of dual that is a direct parallel to Python 2.x strings.

Yes, that is exactly my worry.

I don't worry about this. Strings generated by decoding with surrogate-escape are different from other strings: they contain invalid code units (the naked surrogates). These cannot be encoded except with a surrogate-escape flag to .encode(), and sane developers won't do that unless she knows precisely what she's doing. This is not true with Python 2 strings, where all bytes are valid.

Any reasonable 2.x code has to guard on str/unicode and it would seem in 3.x, if this idiom spreads, reasonable code will have to guard on surrogate escapes (which actually seems like a more expensive test).

Right, I mentioned that concern in my post.

Again, I don't worry about this. It is not an extra cost. Those messages are already broken, they will crash the email module if you fail to guard against them. Decoding them to surrogates actually makes it easier to guard, because you know that even if broken encodings are present, the parser will still work. Broken encodings can no longer crash the parser. That is a Very Good Thing IMHO.

Only if the email package contains a coding error would the surrogates escape and cause problems for user code.

I don't think it is reasonable to internalize surrogates that way; some applications will want to look at them and do something useful with them (delete them or replace them with U+FFFD or ...). However, I argue below that the presence of surrogates already means the user code is under fire, and this puts the problem in a canonical form so the user code can prepare for it (if that is desirable).

It seems like this hack is about making the 3.x unicode type more like the 2.x string type,

Not at all. It's about letting the parser be a parser, and letting the application handle broken content, or discard it, or whatever. Modularity is improved. This has been a major PITA for Mailman support over the years: every time the spammers and virus writers come up with a new idea, there's a chance it will leak out and the email parser will explode, stopping the show. These kinds of errors are a FAQ on the Mailman lists (although much less so in recent years).

How will developers not have to ask themselves whether a given string is a "real" string or a byte sequence masquerading as a string? Am I missing something here?

There are two things to say, actually. First, you're in a war zone. All email is bytes sequences masquerading as text, and if you're not wearing armor, you're going to get burned. The idea here is to have the email package provide the armor and enough instrumentation so you can do bomb detection yourself (or perhaps just let it blow, if you're hacking up a quick and dirty script).

Second, there are developers who will not care whether strings are "real" or "byte sequences in drag", because they're writing MTAs and the like. Those people get really upset, and rightly so, when the parser pukes on broken headers; it is not their app's job at all to deal with that breakage.

I think this question is something that needs to be considered any time using surrogates is proposed.

I don't agree. The presence of naked surrogates is always (assuming sane programmers) an indication of invalid input. The question is, should the parser signal invalidity, or should it allow the application to decide? The email module doesn't have enough information to decide whether the invalid input is a "real" problem, or how to handle it (cf the example of a MTA app). Note that a completely naive app doesn't care -- it will crash either way because it doesn't handle the exception, whether it's raised by the parser or by a codec when the app tries to do I/O. A robust app does care: if the parser raises, then the app must provide an alternative parser good enough to find and fix the invalid bytes. Clearly it's much better to pass invalid (but fully parsed) text back to the app in this case.

Note that if the app really wants the parser to raise rather than pass on the input, that should be easy to implement at fairly low cost; you just provide a variable rather than hardcoding the surrogate-escape flag.



More information about the Python-Dev mailing list