[Python-Dev] Patch making the current email package (mostly) support bytes (original) (raw)

R. David Murray rdmurray at bitdance.com
Thu Oct 7 17:15:18 CEST 2010


On Thu, 07 Oct 2010 15:00:04 +0900, "Stephen J. Turnbull" <stephen at xemacs.org> wrote:

R. David Murray writes:

> > But that's not interesting; you did that with Python 3. We want to > Of course I did it with Python3. It's the Python3 email codebase > I'm working with (and have to work around). Sure. My point is that it has nothing to do with the expections of people trying to upgrade their apps to Python 3, and meeting those expectations is an important requirement of the specification of email5, right?

Well, not necessarily, no. Python3 broke backward compatibility. Some changes are going to have to be made in user code to make it work with email5. Where we can minimize those changes we should, but it isn't a requirement, no. With my patch, the minimization will be message_from_string --> message_from_bytes, message_from_file --> message_from_binary_file, and in some cases Generator --> BytesGenerator, for those programs that need to deal with wire format data that is not 7bit clean. Programs that only generate emails should need few if any changes, but that is already true (that's the half of email that is working :).

Actually, in context we were not talking about a random character that came in from outside, we were talking about U+FFFD that we generated, and know that it's the only non-ASCII character in the string because we replaced all the others with it.

Ah, so that was what you were suggesting.

Of course the best we can do with 'From: =?UNKNOWN?Q?p=C3=B6stal' or 'From: p\xc3\xb6stal' on input is to save the encoded or raw bytes representation and spit it back out on output.

Yes. And I haven't actually dealt with what to do with non-ascii characters or RFC2047 unknown-8bit characters when decoding headers in email6. In issue 6302 we are talking about adding a decode_header_to_string method for email5 where the same issue arises, and so we'll need to make a decision soon. Presumably we'll use U+FFFD to replace them (along with registering defects in email6).

The MIME-charset = UNKNOWN dodge might be a better way of handling this. The str is all ASCII, so won't raise exceptions unless the app itself objects to MIME encoded-words for some reason. OTOH, the presence of encoded words will be a red flag to any human viewer, and after processing with .flatten(), the receiver is likely to DTRT (from the receiving human's point of view, per that human's configuration).

That is a very interesting idea. It is the right thing to do, since it would mean that a message parsed as bytes could be generated via Generator and passed to, say, smtplib without losing any information. However, It's not exactly trivial to implement, since issues of runs of characters and line re-wrapping need need to be dealt with. Perhaps Header can be made to handle bytes in order to do this; I'll have to look in to it.

> So you are suggesting that I should use U+FFFD encoded as UTF-8 > rather than '?' as the substitution character? But earlier you said > that people would probably rather not be forced to deal with Unicode > just because there are invalid bytes in the message. So that's > probably not what you meant.

"Suggest" !=3D "recommend". Talking to a wider base of users and developers, you might or might not find that to be a good idea. I don't think the 800 million or so Chinese coming online in the next decade will much care whether you use U+FFFD or '?'. The Japanese would prefer U+2639 WHITE FROWNING FACE or U+270C VICTORY HAND, no doubt ("crassly cute" is much beloved here). Americans will likely prefer '?', as they probably have correspondents with legacy systems that won't like UTF-8 or perhaps don't have a font to display U+FFFD.

For the moment I think I'll stick with '?', with the idea of "fixing that bug" by using the unknown charset trick at a later stage.

> Presumably you are suggesting that email5 be smart enough to turn my > example into properly UTF-8/CTE encoded text.

No, in general that's undecidable without asking the originator, although humans can often make a good guess. But not always: Japanese are fond of "four-character compound words", and I once found an 8-byte sequence (four 2-byte characters) that is idiomatic in both Shift JIS and EUC-JP. Even a dictionary lookup can't determine the intended encoding for that sequence.

I was talking about unicode input, though, where you do know (modulo the language differences that unicode hasn't yet sorted out).

I'm only saying that any Unicode email-N generates itself can be properly encoded.

Agreed.

> But that problem is what email6 is trying to address. It just > doesn't look practical to address it directly in the email5 code > base, because the email4 codebase that email5 inherits does not > provide the correct distinction between bytes and text. email5 is > parsing the input stream as if it were ASCII-only CTE text.

I don't see how this is different from email6. Just because email6 is trying to DTRT doesn't mean the spammers will, and even Emacs MUA developers occasionally screw this up in new products. So email-N has to handle input streams that are supposed to be entirely ASCII except for message bodies that are properly marked as 8bit or binary CTE, but occasionally will not conform.

Right, but I was talking about my python3 example, where I was using the email5 parser to (unsuccessfully) parse unicode. That's the thing email5 can't really handle, but email6 will be able to.

> Extending it to actually handle unicode input is a whole different > kettle of sushi[*].

But this is not your problem in email5 AFAICS.

Right, but I thought you were suggesting it was. My mistake.

> [*] And I've had an argument with someone who thinks email should > not be extended to handle unicode messages with non-ASCII > content, on the grounds that they aren't really email.

That's total nonsense. Don't argue with people like that, educate them, and if that fails, ignore them. There's good reason for not extending email5, ie, email4 didn't do it. But that has nothing to do with what email "really is".

[ snip good supporting text ]

In practice, email undoubtably has clients that want to manipulate bytes directly. I can't blame them, but the RFCs have nothing to say about that, really. RFC 822 and its family (including MIME) are about representing human media as octet streams compatible with protocols like RFC 821, and in Python the human medium for representing text is str. The result of bytes manipulations should be "as if" the original stream was decoded, manipulated, and reencoded. So direct bytes manipulation is an optimization. The RFCs don't provide for it at all, AFAICS.

The same thing is true of URIs, except that RFC 3896 makes it fully explicit that URIs are conceptually text, not octets. Again, there are many important use cases for bytes manipulation of URIs, but this is an optimization.

Thank you very much for this piece of perspective. I hadn't thought about it that clearly before, but what you say makes perfect sense to me, and is in fact the implicit perspective I've been working from when working on the email6 stuff.

-- R. David Murray www.bitdance.com



More information about the Python-Dev mailing list