[Python-Dev] Patch making the current email package (mostly) support bytes (original) (raw)
R. David Murray rdmurray at bitdance.com
Sun Oct 3 01:00:27 CEST 2010
- Previous message: [Python-Dev] Rietveld integration into Roundup
- Next message: [Python-Dev] Patch making the current email package (mostly) support bytes
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
A while back on some issue or another I remember telling someone that if there was any sort of clever hack that would allow the current email package (email5) to work with bytes we would have implemented it.
Well, I've come up with a clever hack.
The idea came out of a conversation with Antoine. I was saying that it was ironic that Unicode could only be used as a 7bit-clean data transmission channel for email, and he remarked that by using surrogate escape you could use unicode as a transmission channel for 8bit data. At first I dismissed this observation as irrelevant to email, since email has to transform the 8bit data at some point.
But I started thinking. And then I started experimenting. And it turns out that it works.
The clever hack (thanks ultimately to Martin) is to accept 8bit data by encoding it using the ASCII codec and the surrogateescape error handler. Then, inside the email module at any point where bytes might be meaningful or might be about to escape, it can check to see if there are any surrogates and act accordingly.
The API additions are few, and in fact for most programs (he says bravely, not really knowing) there are really only two changes you need to make when converting a program that handles bytes data to py3k. The first is the encoding of binary input data as mentioned. The second is that when you want to get the bytes back out, you use the new BytesGenerator instead of Generator. BytesGenerator is just like Generator except that it writes bytes to its file argument instead of strings, and it recovers any bytes that were in the original input.
So given this sequence:
msg = email.msg_from_file(open('myfile',
encoding='ascii',
errors='surrogateescape'))
email.generator.BytesGenerator(open('myfile2', 'wb')).flatten(msg)
myfile and myflie2 will theoretically be identical (modulo universal newline and _mangle_from issues).
I've additionally added a 'message_from_bytes' convenience function.
One nice feature of this patch is that once you've got the model built from surrogateescaped input, if you do a get_payload() on a message body whose ContentTransferEncoding is '8bit' you will get the body decoded to unicode using the charset declared in the Content-Type header (assuming Python supports that charset).
You can always get at the bytes version of the body of a message part by using get_payload(decode=True) [*]. You can't really get at the bytes version of message headers, though...for safety if you access a header whose value contains non-ASCII chars (that aren't RFC2047 encoded to be ASCII) the 8bit characters get replaced with '?'s. (But BytesGenerator will emit the original 8bit characters if the headers haven't been modified.)
I do not propose that this is a good API, since it has the classic problem that if there are coding bugs in the email module strings may "escape" that have surrogates in them and we end up with programs that work most of the time....except when they fail with mysterious errors because of unusual bytes input data. On the other hand you always know when you have bytes data in an unknown encoding (because they are surrogate escaped), so it is ever so much better than the Python2 situation.
The advantage of this patch is that it means Python3.2 can have an email module that is capable of handling a significant proportion of the applications where the ability to process binary email data is required.
I've uploaded the patch to issue 4661 (http://bugs.python.org/issue4661). I uploaded it to rietveld as well just before Martin's announcement. After the announcement I uploaded the svn patch to the tracker, so hopefully there will be an automated review button as well. Here is your chance to exercise the new review tools :)
This patch does break two of Barry's patch-for-review rules: it is more than 800 lines of diff (but not a lot more, and less than 800 if you count only code diff and not docs), and it did not have a very extensive design discussion beforehand. I did talk with people on IRC, particularly Barry, before finishing the patch, and I did post a summary to the email-sig mailing list (but got no response).
Now it is time to see what the wider community thinks. There is some question of whether this is a bending of the string/bytes separation that doesn't belong as part of the standard library, but after working my way through it I think it is a fairly clean hack[**], and most likely a case where practicality beats purity.
Regardless of whether or not this patch or a descendant thereof is accepted I still intend to continue working on email6. There are many other bugs in the current email package that require a rewrite of parts of its infrastructure, and the email-sig is agreed that the email API needs revision quite apart from the bytes/string issues. However, there is something pleasing about the simplicity of this way of handling bytes that I intend to consider carefully while we work further on email6.
--David
[*] It is counterintuitive that 'decode=True' gives you bytes and 'decode=False' gives you strings, but in this case 'decode' refers to the ContentTransferEncoding...and this confusion is one of the reasons I think the email API needs a big overhaul.
[**] There are a couple places where generator pokes into the internals of Message in a way it hasn't before, but this could be fixed by defining a 'bytes access' API on Message, which would probably be a good idea anyway. There is also the possibility of wrapping up the 'ascii+surrogateesape' stuff inside APIs that accept input data, to hide that 'implementation detail' from the email package user.
- Previous message: [Python-Dev] Rietveld integration into Roundup
- Next message: [Python-Dev] Patch making the current email package (mostly) support bytes
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]