[Python-Dev] email package status in 3.X (original) (raw)

P.J. Eby pje at telecommunity.com
Mon Jun 21 22:09:52 CEST 2010


At 03:29 PM 6/21/2010 -0400, Toshio Kuratomi wrote:

On Mon, Jun 21, 2010 at 01:24:10PM -0400, P.J. Eby wrote: > At 12:34 PM 6/21/2010 -0400, Toshio Kuratomi wrote: > >What do you think of making the encoding attribute a mandatory part of > >creating an ebyte object? (ex: eb = ebytes(b, 'euc-jp')). > > As long as the coercion rules force str+ebytes (or str % ebytes, > ebytes % str, etc.) to result in another ebytes (and fail if the str > can't be encoded in the ebytes' encoding), I'm personally fine with > it, although I really like the idea of tacking the encoding to bytes > objects in the first place. > I wouldn't like this. It brings us back to the python2 problem where sometimes you pass an ebyte into a function and it works and other times you pass an ebyte into the function and it issues a traceback.

For stdlib functions, this isn't going to happen unless your ebytes' encoding is not compatible with the ascii subset of unicode, or the stdlib function is working with dynamic data... in which case you really do want to fail early!

I don't see this as a repeat of the 2.x situation; rather, it allows you to cause errors to happen much earlier than they would otherwise show up if you were using unicode for your encoded-bytes data.

For example, if your program's intent is to end up with latin-1 output, then it would be better for an error to show up at the very first point where non-latin1 characters are mixed with your data, rather than only showing up at the output boundary!

However, if you promoted mixed-type operation results to unicode instead of ebytes, then you:

  1. can't preserve data that doesn't have a 1:1 mapping to unicode, and

  2. can't detect an error until your data reaches the output point in your application -- forcing you to defensively insert ebytes calls everywhere (vs. simply wrapping them around a handful of designated inputs), or else have to go right back to tracing down where the unusable data showed up in the first place.

One thing that seems like a bit of a blind spot for some folks is that having unicode is not everybody's goal. Not because we don't believe unicode is generally a good thing or anything like that, but because we have to work with systems that flat out don't do unicode, thereby making the presence of (fully-general) unicode an error condition that has to be stamped out!

IOW, if you're producing output that has to go into another system that doesn't take unicode, it doesn't matter how theoretically-correct it would be for your app to process the data in unicode form. In that case, unicode is not a feature: it's a bug.

And as it really is an error in that case, it should not pass silently, unless explicitly silenced.

So, what's the advantage of using ebytes instead of bytes?

* It keeps together the text and encoding information when you're taking bytes in and want to give bytes back under the same encoding. * It takes some of the boilerplate that people are supposed to do (checking that bytes are legal in a specific encoding) and writes it into the initialization of the object. That forces you to think about the issue at two points in the code: when converting into ebytes and when converting out to bytes. For data that's going to be used with both str and bytes, this is the accepted best practice. (For exceptions, the byte type remains which you can do conversion on when you want to).

Hm. For the output case, I suppose that means you might also want the text I/O wrappers to be able to be strict about ebytes' encoding.



More information about the Python-Dev mailing list