[Python-Dev] Patch making the current email package (mostly) support bytes (original) (raw)

R. David Murray rdmurray at bitdance.com
Wed Oct 6 18🔞03 CEST 2010


On Wed, 06 Oct 2010 12:22:18 +0900, "Stephen J. Turnbull" <stephen at xemacs.org> wrote:

Nick Coghlan writes:

> - if you pass in bytes data and know what you are doing, then you can > access that raw bytes data and do your own decoding At what level, though? To take an interesting example I used to see frequently: From: taro at tokyo.jp (Taro Yamada in 8-bit Shift JIS) So I guess you are suggesting that the email module can RFC 822 parse that, and 1. Refuse to return the unwrapped (ie, single line) form of the whole field, except as bytes. 2. Refuse to return the content of the From field, except as bytes. 3. Return the email address parsed from the From field. 4. Refuse to return the comment, except as bytes.

  1. Return the content, with non-ASCII bytes replaced with ? characters.

In other words, my proposed patch only makes email5 1/8 to 1/4 broken, instead of half broken as it is now. But not un-broken enough for Mailman, it sounds like.

That's fine. But suppose I have a private or newly defined header that is structured? Now I have two choices:

1. Write a version of my private parser for both str (the normal case) and bytes (if accessing the value as str raises) 2. Always get the bytes and convert them to str (probably using the same .decode('ascii','surrogate-escape') call that email uses but won't let me have the value of!), then use a common str parser.

Yes, this is exactly the dilemma faced by the entire email package. The current email6 code attempts to do a variation on (1) by having a common parser that handles both strings and bytes using a dual subclass approach. This patch is trying out (2). If you have a private header parser, you would ideally like to be able to use the same mechanism as the email package to solve the problem. For email6 you'd be able to register your header parser and get handed the input like the built in parser and be able to use the tools provided by the built in parser to do your work.

In email5 there is no way that I know of for you to register a private parser, so you need access to the raw input for the header in one form or another.

If we go this route (as opposed to only handling headers with 8bit data by sanitizing them), then we need to think about the email5 header parsers as well (decode_header and parseaddr). They are of course going to have the same problems as the rest of the email package with parsing bytes, and you are suggesting that access to those header 8bit bytes is needed.

One option would be to add a keyword to the get and get_all methods that instructs it to return the string with the surrogate-escaped bytes, which can then be passed onward to decode_header, parseaddr, or a custom decoder. Then I need to look at what needs to be added to those methods to handle the escaped bytes, and from what you say they too need a keyword telling them to preserve the escaped bytes on output (a "yes I know what I'm doing" flag...'preserve_escaped_bytes=True'?).

Note that this is more problematic than it looks, since the appropriate base codec may require information from higher-level structures (eg, qp codec tags or a Content-Type header's charset field).

You'll have to give me an example of where this is a problem but is not already a problem in email4.

Why should I reproduce email's logic here? I don't care if the default or concise API raises on surrogates in the str value. But I'm pretty sure that I will want to use str values containing surrogates in these contexts (for the same reasons that email module does, for example), rather than work with bytes sometimes and strs sometimes.

Please provide a way to return strs-with-surrogates if I ask for them.

Does my proposal make sense? But note, it raises exactly the backward compatibility concerns you mention in your next email (that I will reply to next). It is an open question whether it is worth opening that door in order to be able to do extended handling on non-RFC conforming email (as opposed to just sanitizing it and soldering on).

-- R. David Murray www.bitdance.com



More information about the Python-Dev mailing list