[Python-Dev] Patch making the current email package (mostly) support bytes (original) (raw)
R. David Murray rdmurray at bitdance.com
Wed Oct 6 19:09:25 CEST 2010
- Previous message: [Python-Dev] Patch making the current email package (mostly) support bytes
- Next message: [Python-Dev] Patch making the current email package (mostly) support bytes
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Wed, 06 Oct 2010 22:55:00 +0900, "Stephen J. Turnbull" <stephen at xemacs.org> wrote:
R. David Murray writes:
> version of headers to the email5 API, but since any such data would > be non-RFC compliant anyway, [access to non-conforming headers by > reparsing the bytes] will just have to be good enough for now. But that's potentially unpleasant for, say, Mailman. AFAICS, what you're saying is that Mailman will have to implement a full header parser and repair module, or shunt (and wait for administrator intervention on) any mail that happens to contain even one byte of non-RFC-conforming content in a header it cares about. (Note that
No, it just means that such bytes would not be preserved for presentation in the web UI. They'd show up as '?'s (Or U+FFFDs, perhaps, if I change DeocdedGenerator to use U+FFFD instead of ?s for the unknown bytes). As long as BytesGenerator is used on the output side to send the messages, the bytes will be preserved and presented to the moderator in their email.
So the only parsing issue is if Mailman cares about the non-ASCII bytes in the headers it cares about. If it has to modify headers that contain non-ASCII bytes (for example, addresses and Subject) and cares about preserving the non-ASCII bytes, then there is indeed an issue; see previous email for a possible way around that.
we're not talking about moderator-level admins here; we're talking about the Big Cheese with access to the command line on the list host.) That's substantially worse than the current system, where (in theory, and in actual practice where it distributes its own version of email) it can trap the Unicode exception on a per-header basis.
I thought mailman no longer distributed its own version of email? And the email API currently promises not to raise during parsing, which is a contract my patch does not change.
I also worry about the implications for backwards compatibility. Eventually email-N needs to handle non-conforming mail in a sensible way, or anybody who gets spam (ie, everybody) and wants a reliable email system will need to implement their own. If you punt completely on handling non-conforming mail now, when is it going to be done? And
We're (in the current patch) not punting on handling non-conforming email, we're punting on handling non-conforming bytes if the headers that contain them need to be modified. The headers can still be modified, you just (currently) lose the non-ASCII bytes in the process.
when it is done, will the backward-compatible interface be able to access the robust implementation, or will people who want robust APIs have to use rather different ones? The way you're going right now, I have to worry about the answer to the second question, at least.
Well, this is still theory given the current state of the email6 code, but I think that working email5 code, even after this patch, will continue to work using email6's backward compatibility interface. And robustness is not the issue, only extended-beyond-the-RFCs handling of non-conforming bytes would be an issue.
But, as I implied in my previous email, if we allow the surrogates out so that custom header parsers can use them, then making that code continue to work may require an extra layer in the compatibility interface to produce the surrogateescaped strings. Still, at the moment I can't see any theoretical reason why that would not be possible, so it may be worth the risk.
> [*] Why '?' and not the unicode invalid character character? Well, the > email5 Generate.flatten can be used to generate data for transmission over > the wire if the source is RFC compliant and 7bit-only, and this would > be a normal email5 usage pattern (that is, smtplib.SMTP.sendmail expects > ASCII-only strings as input!). So the data generated by Generator.flatten > should not include unicode...
I don't understand this at all. Of course the byte stream generated by Generator.flatten won't contain Unicode (in the headers, anyway); it will contain only ASCII (that happens to conform to QP or Base64 encoding of Unicode in some appropriate UTF in many cases). Why is U+FFFD REPLACEMENT CHARACTER any different from any other non-ASCII character in this respect? (Surely you are not saying that Generator.flatten can't DTRT with non-ASCII content at all?)
Yes, that is exactly what I am saying:
_m = email.messagefromstring("""_ ... From: pöstal ...
... """) str(m) Traceback (most recent call last): .... UnicodeEncodeError: 'ascii' codec can't encode character '\xf6' in position 1: ordinal not in range(128)
Remember, email5 is a direct translation of email4, and email4 only handled ASCII and oh-by-the-way-if-there-are-bytes-along-for-the- -ride-fine-we'll-pass-then-along. So if you want to put non-ASCII data into a message you have to encode it properly to ASCII in exactly the same way that you did in email4:
m = email.message.Message() m['From'] = email.header.Header("pöstal", charset='utf-8') str(m) 'From: =?utf-8?q?p=C3=B6stal?=\n\n'
The only thing I can think of is that you might not want to introduce non-ASCII characters into a string that looks like it might simply be corrupted in transmission (eg, it contains only one non-ASCII byte). That's reasonable; there are a lot of people who don't have to deal with anything but ASCII and occasionally Latin-1, and they don't like having Unicode crammed down their throats.
> which raises a problem for CTE 8bit sections > that the patch doesn't currently address. AFAIK, there's no requirement, implied or otherwise, that a conforming implementation produce CTE 8bit. So just don't do that; that will keep smtplib happy, no?
Yes, exactly. I need to fix the patch to recode using, say, quoted-printable in that case. DecodedGenerator could still produce the unicode, though, which is what I believe we want. (Although that raises the question of whether DecodedGenerator should also decode the RFC2047 encoded headers....but that raises a backward compatibility issue).
--David
- Previous message: [Python-Dev] Patch making the current email package (mostly) support bytes
- Next message: [Python-Dev] Patch making the current email package (mostly) support bytes
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]