[Python-Dev] Patch making the current email package (mostly) support bytes (original) (raw)

R. David Murray rdmurray at bitdance.com
Wed Oct 6 23:39:14 CEST 2010


On Thu, 07 Oct 2010 03:31:34 +0900, "Stephen J. Turnbull" <stephen at xemacs.org> wrote:

R. David Murray writes:

> 5. Return the content, with non-ASCII bytes replaced with ? > characters. That hadn't occurred to me (and it makes me sick to contemplate it). That said, this is probably good enough for Mailman-like apps to limp along for "most" users. It's certainly good enough for the "might kick your wife and elope with your dog" alpha ports of Mailman to Python 3 (well, as certain as I can be; of course in the end Barry decides). Assuming reasonable backward compatibility of the API, of course!

Yeah, "good enough" is pretty much the goal here.

> In other words, my proposed patch only makes email5 1/8 to 1/4 > broken, instead of half broken as it is now. But not un-broken > enough for Mailman, it sounds like.

IMO, not in the long run. But realistically, in the applications I know of, most desired traffic is conformant, and since there aren't any Python 3 email apps yet, this isn't even a regression. :-/ I do think that it's important that the parsed object be able to tell you what fields are there (except if the field name itself is invalid) and return field bodies parsed as far as possible.

Well, email doesn't currently parse the bodies any further by itself. You have to call parsing routines to get further parsing. So maybe what I should do is work on finalizing the patch without addressing the 'give me the escaped bytes issue', and then prepare a follow on patch that adds that keyword and adjusts the header parsing helpers accordingly.

> If we go this route (as opposed to only handling headers with 8bit data by > sanitizing them), then we need to think about the email5 header parsers > as well (decodeheader and parseaddr). They are of course going to have > the same problems as the rest of the email package with parsing bytes, > and you are suggesting that access to those header 8bit bytes is needed.

Yes, that would be preferable to replacing them with ASCII junk. But I don't see any problem with parsing them; they're syntactically insignificant by definition. The problem is purely on output: do I get verbatim escaped bytes, a sanitized str, or an exception?

Right, the needed changes should be sanitizing by default, and providing the keyword to get the escaped bytes. Mostly it'll be writing tests :)

> Does my proposal make sense? But note, it raises exactly the backward > compatibility concerns you mention in your next email (that I will reply > to next). It is an open question whether it is worth opening that door > in order to be able to do extended handling on non-RFC conforming email > (as opposed to just sanitizing it and soldering on).

Well, maybe not. However, it is not obvious to me that you won't run into these issues again in Email6. Applications that think of email as textual objects are going to want to make their own choices about handling of non-conforming email, and it's likely to be massively inconvenient to say "OK, but you have to use bytes interfaces exclusively, because the str interfaces don't handle that."

The strategy in email6 so far is for the application program to be able to access any piece of the parsed data as either text or bytes, and for the header parsers to record defects when there are non-ASCII bytes where there aren't supposed to be. So the application can check for defects and retrieve, say, the comment field that has the non-ASCII as bytes and decode it. Or, if it doesn't care about parsing them, it just modifies the fields it wants to modify that are valid, and the invalid non-ASCII comment gets carried along and emitted when the message is serialized as bytes.

This is more or less what we are talking about enabling in email5 with the 'escape_bytes=True' keyword, it's just a less structured and more error prone approach to it than what we have planned for email6.

-- R. David Murray www.bitdance.com



More information about the Python-Dev mailing list