[Python-Dev] Patch making the current email package (mostly) support bytes (original) (raw)

Nick Coghlan ncoghlan at gmail.com
Tue Oct 5 14:05:33 CEST 2010


On Tue, Oct 5, 2010 at 3:41 PM, Stephen J. Turnbull <stephen at xemacs.org> wrote:

R. David Murray writes:  > Only if the email package contains a coding error would the  > surrogates escape and cause problems for user code.

I don't think it is reasonable to internalize surrogates that way; some applications will want to look at them and do something useful with them (delete them or replace them with U+FFFD or ...).  However, I argue below that the presence of surrogates already means the user code is under fire, and this puts the problem in a canonical form so the user code can prepare for it (if that is desirable).

Hang on here, this objection doesn't seem to quite mesh with what RDM is proposing (and the similar trick I am considering for urllib.parse).

The basic issue is having an algorithm that is designed to operate on character data and depends on multiple ASCII constants stored as str objects.

In Python 2.x, those algorithms could innately operate on str objects in any ASCII compatible encoding, as well as on unicode objects (due to the implicit promotion of the ASCII constants to unicode when unicode input was encountered).

In Py3k, that trick broke. Now those algorithms only operate on str objects, and bytes input fails, even when it uses an ASCII compatible encoding.

For urllib.parse, the external API will be "str in -> str out, bytes in -> bytes out". Whether that is internally implemented by duplicating all the ASCII constants with both bytes and str flavours (as my current patch does), or implicitly (and temporarily) "decoding" the bytes values using ascii+surrogateescape or latin-1 (a pair of alternative approaches I plan to explore soon) should be completely transparent to the user of the API. If a user can easily tell which of these I am doing just through the external behaviour of the documented API, then I'll have made a mistake somewhere.

My understanding is that email6 in 3.3 will essentially follow that same model. What I believe RDM is suggesting is an in-between approach for the 3.2 email module:

I've probably grossly oversimplified what RDM is suggesting, but it sounds plausible as a useful interim stepping stone to the more comprehensive type separation in email6.

Cheers, Nick.

-- Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia



More information about the Python-Dev mailing list