[Python-Dev] Patch making the current email package (mostly) support bytes (original) (raw)

R. David Murray rdmurray at bitdance.com
Thu Oct 7 02:46:08 CEST 2010


Stephen J. Turnbull <stephen xemacs.org> writes:

R. David Murray writes: > We're (in the current patch) not punting on handling non-conforming > email, we're punting on handling non-conforming bytes *if the headers > that contain them need to be modified*. The headers can still be > modified, you just (currently) lose the non-ASCII bytes in the process.

Modified or examined. I can't think of any important applications offhand that need to examine the non-ASCII bytes (in particular, Mailman doesn't need to do that). Verbatim copying of the bytes themselves is almost always the desired usage.

Mmm. Yes, or examined. If we allow escaped bytes to be returned, perhaps we also should provide a helper that "unescapes" the bytes and returns the byte string (yes, this is just a call to encode, but by wrapping it we continue to hide the surrogateescape implementation detail.)

> And robustness is not the issue, only extended-beyond-the-RFCs handling > of non-conforming bytes would be an issue.

And with that, I'm certain that Jon Postel is really dead.

A goal for email6 is to be at least as Postel compliant as email4. The goal for my patch is to make email5.1 more Postel compliant than email5.0 is :)

> > (Surely you are not saying that Generator.flatten can't DTRT with > > non-ASCII content at all?) > > Yes, that is exactly what I am saying: > _> >>> m = email.messagefromstring("""_ > ... From: pöstal > ... > ... """) > >>> str(m) > Traceback (most recent call last): > .... > UnicodeEncodeError: 'ascii' codec can't encode character '\xf6' in position 1: ordinal not in range(128)

But that's not interesting; you did that with Python 3. We want to

Of course I did it with Python3. It's the Python3 email codebase I'm working with (and have to work around).

know what people porting from Python 2 will expect. So, in 2.5.5 or 2.6.6 on Mac, with email v4.0.2, it doesn't raise, it returns

wideload:~ 4:14$ python Python 2.5.5 (r255:77872, Jul 13 2010, 03:03:57) [GCC 4.0.1 (Apple Inc. build 5490)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import email >>> m=email.messagefromstring('From: pöstal\n\n') >>> str(m) 'From nobody Thu Oct 7 04🔞25 2010\nFrom: p\xc3\xb6stal\n\n' >>> m['From'] 'p\xc3\xb6stal' >>> That's hardly helpful! Surely we can and should do better than that now, especially since UTF-8 (with a proper CTE) is now almost universally acceptable to MUAs. When would it be a problem for that to return 'From nobody Thu Oct 7 04🔞25 2010\nFrom: =?UTF-8?Q?p=C3=B6stal?=\n\n'

What's wrong with that is that when we parse the bytes of the message we don't know that b'\xc3\xb6' == '=?UTF-8?Q?=C3=B6?='. It isn't even all that likely to be true, since I would guess that latin1 is still more common than utf-8 (but you might know better).

> Remember, email5 is a direct translation of email4, and email4 only > handled ASCII and oh-by-the-way-if-there-are-bytes-along-for-the- > -ride-fine-we'll-pass-then-along. So if you want to put non-ASCII > data into a message you have to encode it properly to ASCII in > exactly the same way that you did in email4:

But if you do it right, then it will still work in a version that just encodes non-ASCII characters in UTF-8 with the appropriate CTE. Since you'll never be passing it non-ASCII characters, it's already ASCII and UTF-8, and no CTE will be needed.

So you are suggesting that I should use U+FFFD encoded as UTF-8 rather than '?' as the substitution character? But earlier you said that people would probably rather not be forced to deal with Unicode just because there are invalid bytes in the message. So that's probably not what you meant.

Presumably you are suggesting that email5 be smart enough to turn my example into properly UTF-8/CTE encoded text. But that problem is what email6 is trying to address. It just doesn't look practical to address it directly in the email5 code base, because the email4 codebase that email5 inherits does not provide the correct distinction between bytes and text. email5 is parsing the input stream as if it were ASCII-only CTE text. I'm trying to extend it to also handle non-ASCII bytes gracefully. Extending it to actually handle unicode input is a whole different kettle of sushi[*].

> Yes, exactly. I need to fix the patch to recode using, say, > quoted-printable in that case.

It really should check for proportions of non-ASCII. QP would be horrible for Japanese or Chinese.

Noted.

> DecodedGenerator could still produce the unicode, though, which is > what I believe we want. (Although that raises the question of > whether DecodedGenerator should also decode the RFC2047 encoded > headers....but that raises a backward compatibility issue).

Can't really help you there. While I would want the RFC 2047 headers decoded if I were writing new code (which is generally the case for me), I haven't really wrapped my head around the issues of porting old code using Python2 str to Python3 str here. My intuition says "no problem" (there won't be any MIME-words so the app won't try to decode them), but I'm not real sure of that.

Thinking about this further, I think it is unlikely that an application using DecodedGenerator would be further processing the headers generated by it, so I think this is probably a safe enough change, given that there are few if any Python3 email handling applications at this point. If anyone knows of a Python2 application that does post-process DecodedGenerator headers, please let me know.

--David

[*] And I've had an argument with someone who thinks email should not be extended to handle unicode messages with non-ASCII content, on the grounds that they aren't really email.



More information about the Python-Dev mailing list