[Python-Dev] Patch making the current email package (mostly) support bytes (original) (raw)

Stephen J. Turnbull stephen at xemacs.org
Wed Oct 6 21:40:03 CEST 2010


R. David Murray writes:

So the only parsing issue is if Mailman cares about the non-ASCII bytes in the headers it cares about. If it has to modify headers that contain non-ASCII bytes (for example, addresses and Subject) and cares about preserving the non-ASCII bytes, then there is indeed an issue; see previous email for a possible way around that.

OK.

I thought mailman no longer distributed its own version of email?

I believe so; the point is that it could do so again.

And the email API currently promises not to raise during parsing, which is a contract my patch does not change.

Which is a contract that has historically been broken frequently. Unhandled UnicodeErrors have been one of the most common causes of queue stoppage in Mailman (exceeded only by configuration errors AFAICS). I haven't seen any reports for a while, but with the email package being reengineered from the ground up, the possibility of regression can't be ignored.

Granted, there should be no regression problem in the current model for Email5, AIUI.

We're (in the current patch) not punting on handling non-conforming email, we're punting on handling non-conforming bytes if the headers that contain them need to be modified. The headers can still be modified, you just (currently) lose the non-ASCII bytes in the process.

Modified or examined. I can't think of any important applications offhand that need to examine the non-ASCII bytes (in particular, Mailman doesn't need to do that). Verbatim copying of the bytes themselves is almost always the desired usage.

And robustness is not the issue, only extended-beyond-the-RFCs handling of non-conforming bytes would be an issue.

And with that, I'm certain that Jon Postel is really dead. :-(

(Surely you are not saying that Generator.flatten can't DTRT with non-ASCII content at all?)

Yes, that is exactly what I am saying:

m = email.message_from_string("""
... From: pöstal ...
... """) str(m) Traceback (most recent call last): .... UnicodeEncodeError: 'ascii' codec can't encode character '\xf6' in position 1: ordinal not in range(128)

But that's not interesting; you did that with Python 3. We want to know what people porting from Python 2 will expect. So, in 2.5.5 or 2.6.6 on Mac, with email v4.0.2, it doesn't raise, it returns

wideload:~ 4:14$ python Python 2.5.5 (r255:77872, Jul 13 2010, 03:03:57) [GCC 4.0.1 (Apple Inc. build 5490)] on darwin Type "help", "copyright", "credits" or "license" for more information.

import email m=email.messagefromstring('From: pöstal\n\n') str(m) 'From nobody Thu Oct 7 04🔞25 2010\nFrom: p\xc3\xb6stal\n\n' m['From'] 'p\xc3\xb6stal'

That's hardly helpful! Surely we can and should do better than that now, especially since UTF-8 (with a proper CTE) is now almost universally acceptable to MUAs. When would it be a problem for that to return

'From nobody Thu Oct 7 04🔞25 2010\nFrom: =?UTF-8?Q?p=C3=B6stal?=\n\n'

Remember, email5 is a direct translation of email4, and email4 only handled ASCII and oh-by-the-way-if-there-are-bytes-along-for-the- -ride-fine-we'll-pass-then-along. So if you want to put non-ASCII data into a message you have to encode it properly to ASCII in exactly the same way that you did in email4:

But if you do it right, then it will still work in a version that just encodes non-ASCII characters in UTF-8 with the appropriate CTE. Since you'll never be passing it non-ASCII characters, it's already ASCII and UTF-8, and no CTE will be needed.

Yes, exactly. I need to fix the patch to recode using, say, quoted-printable in that case.

It really should check for proportions of non-ASCII. QP would be horrible for Japanese or Chinese.

DecodedGenerator could still produce the unicode, though, which is what I believe we want. (Although that raises the question of whether DecodedGenerator should also decode the RFC2047 encoded headers....but that raises a backward compatibility issue).

Can't really help you there. While I would want the RFC 2047 headers decoded if I were writing new code (which is generally the case for me), I haven't really wrapped my head around the issues of porting old code using Python2 str to Python3 str here. My intuition says "no problem" (there won't be any MIME-words so the app won't try to decode them), but I'm not real sure of that. ;-)



More information about the Python-Dev mailing list