[Python-Dev] Patch making the current email package (mostly) support bytes (original) (raw)

Nick Coghlan ncoghlan at gmail.com
Tue Oct 5 14:05:33 CEST 2010

Previous message: [Python-Dev] Patch making the current email package (mostly) support bytes
Next message: [Python-Dev] Patch making the current email package (mostly) support bytes
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Tue, Oct 5, 2010 at 3:41 PM, Stephen J. Turnbull <stephen at xemacs.org> wrote:

R. David Murray writes: > Only if the email package contains a coding error would the > surrogates escape and cause problems for user code.

I don't think it is reasonable to internalize surrogates that way; some applications will want to look at them and do something useful with them (delete them or replace them with U+FFFD or ...). However, I argue below that the presence of surrogates already means the user code is under fire, and this puts the problem in a canonical form so the user code can prepare for it (if that is desirable).

Hang on here, this objection doesn't seem to quite mesh with what RDM is proposing (and the similar trick I am considering for urllib.parse).

The basic issue is having an algorithm that is designed to operate on character data and depends on multiple ASCII constants stored as str objects.

In Python 2.x, those algorithms could innately operate on str objects in any ASCII compatible encoding, as well as on unicode objects (due to the implicit promotion of the ASCII constants to unicode when unicode input was encountered).

In Py3k, that trick broke. Now those algorithms only operate on str objects, and bytes input fails, even when it uses an ASCII compatible encoding.

For urllib.parse, the external API will be "str in -> str out, bytes in -> bytes out". Whether that is internally implemented by duplicating all the ASCII constants with both bytes and str flavours (as my current patch does), or implicitly (and temporarily) "decoding" the bytes values using ascii+surrogateescape or latin-1 (a pair of alternative approaches I plan to explore soon) should be completely transparent to the user of the API. If a user can easily tell which of these I am doing just through the external behaviour of the documented API, then I'll have made a mistake somewhere.

My understanding is that email6 in 3.3 will essentially follow that same model. What I believe RDM is suggesting is an in-between approach for the 3.2 email module:

if you pass in bytes data that isn't 7-bit clean and naively use the str APIs to access the headers, then it will complain loudly if it is about to return escaped data (but will decode the body in accordance with the Content Transfer Encoding)
if you pass in bytes data and know what you are doing, then you can access that raw bytes data and do your own decoding

I've probably grossly oversimplified what RDM is suggesting, but it sounds plausible as a useful interim stepping stone to the more comprehensive type separation in email6.

Cheers, Nick.

-- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia

Previous message: [Python-Dev] Patch making the current email package (mostly) support bytes
Next message: [Python-Dev] Patch making the current email package (mostly) support bytes
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list