Issue 31820: Calling email.message.set_payload twice produces an invalid eml (original) (raw)

Example:

In [52]: import email.message

In [53]: m = email.message.Message()

In [54]: m.set_payload('abc', 'utf8')

In [55]: m.get_payload() # correctly encoded Out[55]: 'YWJj\n'

In [56]: m.set_payload('abc', 'utf8')

In [57]: m.get_payload() # no more encoding? Out[57]: 'abc'

In [58]: m.get_payload(decode=True) # wut? Out[58]: b'i\xb7'

In [59]: print(str(m)) MIME-Version: 1.0 Content-Type: text/plain; charset="utf8" Content-Transfer-Encoding: base64

abc

While the first set_payload correctly encodes and sets the message's Content-Transfer-Encoding, the second call doesn't properly encode the payload according to its existing Content-Transfer-Encoding. Tested on 3.6, 3.5 and 2.7.

email.message.set_payload does not directly encode the payload, instead email.message.set_charset does, around line 353: https://github.com/python/cpython/blob/b067c8fdd1e205bd0411417b6d5e4b832c3773fc/Lib/email/message.py#L353-L368

In both invocations of set_payload, the payload is not encoded according to the encoding. On the first invocation, the CTE header is correctly set according to charset.get_body_encoding (#354 and #368) and the payload is encoded (#356 or #367, the latter in this case).

On the second invocation, the CTE header is already set, so the payload is never encoded.

This is especially dangerous when passing decode=True to get_payload after the 2nd set_payload, as that may throw an error in some cases (trying to base64 decode a string which makes no sense to it. that's how I arrived on this bug, but I can't for the life of me replicate an exception).

This is a bit finicky to fix. If we change set_charset to always encode the current payload, we risk double-encoding when set_charset is not called through set_payload. However if set_charset tries to decode the payload, it will produce incorrect results when it is called through set_payload. urgh.

We can move the actual encoding code away from set_charset, either into set_payload or a third function, but that'll break any code calling set_payload without a charset and then calling set_charset. urgh.

One possible solution is for both set_charset and set_payload to call a third function, e.g. _encode_payload. Perhaps something like (pseudocode):

def set_payload(self, payload, charset): # ... if 'Content-Transfer-Encoding' in self: self._payload = self._encode_payload(payload) self.set_charset(charset) # ...

def set_charset(self, charset): # ... if 'Content-Transfer-Encoding' not in self: self._payload = self._encode_payload() self.add_header(...)

def _encode_payload(self, payload): # code in lines 353-366

This way, set_charset handles the cases where CTE was never defined, and set_payload makes sure to encode the payload when a CTE is present. It may work, but something about this gives me unrest. For example, if you set_charset once and it encodes your payload as base64, and you set_charset again with a charset whose get_body_encoding returns qp, the payload would still be base64 even though it should be qp. urgh.

Is this a big enough concern? Is there a superior approach I'm missing?

Thanks in advance!

On irc, bitmancer suggested that this problem is already solved by the email.message.EmailMessage class, as it is:

In [119]: m = email.message.EmailMessage()

In [120]: m.set_content('abc', 'utf8', cte='base64')

In [121]: m.get_payload() Out[121]: 'YWJjCg==\n'

In [122]: m.set_content('abc', 'utf8', cte='base64')

In [123]: m.get_payload() Out[123]: 'YWJjCg==\n'

In [124]: m.get_payload(decode=True) Out[124]: b'abc\n'

In [125]: print(m) MIME-Version: 1.0 Content-Type: text/utf8; charset="utf-8" Content-Transfer-Encoding: base64

YWJjCg==

Because this isn't a critical bug and email.message.Message is quite deprecated, and this is solved by a newer API, this bug may not need addressing.