[Python-Dev] Polymorphic best practices [was: (Not) delaying the 3.2 release] (original) (raw)

Michael Foord fuzzyman at voidspace.org.uk
Fri Sep 17 21:25:39 CEST 2010

Previous message: [Python-Dev] Polymorphic best practices [was: (Not) delaying the 3.2 release]
Next message: [Python-Dev] Polymorphic best practices [was: (Not) delaying the 3.2 release]
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 16/09/2010 23:05, Antoine Pitrou wrote:

On Thu, 16 Sep 2010 16:51:58 -0400 "R. David Murray"<rdmurray at bitdance.com> wrote:

What do we store in the model? We could say that the model is always text. But then we lose information about the original bytes message, and we can't reproduce it. For various reasons (mailman being a big one), this is not acceptable. So we could say that the model is always bytes. But we want access to (for example) the header values as text, so header lookup should take string keys and return string values[2]. Why can't you have both in a single class? If you create the class using a bytes source (a raw message sent by SMTP, for example), the class automatically parses and decodes it to unicode strings; if you create the class using an unicode source (the text body of the e-mail message and the list of recipients, for example), the class automatically creates the bytes representation. I think something like this would be great for WSGI. Rather than focus on whether bytes or text should be used, use a higher level object that provides a bytes view, and (where possible/appropriate) a unicode view too.

Michael

(of course all processing can be done lazily for performance reasons)

What about email files on disk? They could be bytes, or they could be, effectively, text (for example, utf-8 encoded). Such a file can be two things: - the raw encoding of a whole message (including headers, etc.), then it should be fed as a bytes object - the single text body of a hypothetical message, then it should be fed as a unicode object I don't see any possible middle-ground. On disk, using utf-8, one might store the text representation of the message, rather than the wire-format (ASCII encoded) version. We might want to write such messages from scratch. But then the user knows the encoding (by "user" I mean what/whoever calls the email API) and mentions it to the email package. What I'm having an issue with is that you are talking about a bytes representation and an unicode representation of a message. But they aren't representations of the same things: - if it's a bytes representation, it will be the whole, raw message including envelope / headers (also, MIME sections etc.) - if it's an unicode representation, it will only be a section of the message decodable as such (a text/plain MIME section, for example; or a decoded header value; or even a single e-mail address part of a decoded header) So, there doesn't seem to be any reason for having both a BytesMessage and an UnicodeMessage at the same abstraction level. They are both representing different things at different abstraction levels. I don't see any potential for confusion: raw assembled e-mail message = bytes; decoded text section of a message = unicode. As for the problem of potential "bogus" raw e-mail data (e.g., undecodable headers), well, I guess the library has to make a choice between purity and practicality, or perhaps let the user choose themselves. For example, through a strict flag. If strict is true, raise an error as soon as a non-decodable byte appears in a header, if strict is false, decode it through a default (encoding, errors) convention which can be overriden by the user (a sensible possibility being "utf-8, surrogateescape" to allow for lossless round-tripping). As I said above, we could insist that files on disk be in wire-format, and for many applications that would work fine, but I think people would get mad at us if didn't support text files[3]. Again, this simply seems to be two different abstraction levels: pre-generated raw email messages including headers, or a single text waiting to be embedded in an actual e-mail. Anyway, what polymorphism means in email is that if you put in bytes, you get a BytesMessage, if you put in strings you get a StringMessage, and if you want the other one you convert. And then you have two separate worlds while ultimately the same concepts are underlying. A library accepting BytesMessage will crash when a program wants to give a StringMessage and vice-versa. That doesn't sound very practical. [1] Now that surrogateesscape exists, one might suppose that strings could be used as an 8bit channel, but that only works if you don't need to parse the non-ASCII data, just transmit it. Well, you can parse it, precisely. Not only, but it round-trips if you unparse it again:

headerbytes = b"From: bogus\xFFname<someone at python.com>" name, value = headerbytes.decode("utf-8", "surrogateescape").split(":") name 'From' value ' bogus\udcffname<someone at python.com>' "{0}:{1}".format(name, value).encode("utf-8", "surrogateescape") b'From: bogus\xffname<someone at python.com>' In the end, what I would call a polymorphic best practice is "try to avoid bytes/str polymorphism if your domain is well-defined enough" (which I admit URLs aren't necessarily; but there's no question a single text/XXX e-mail section is text, and a whole assembled e-mail message is bytes). Regards Antoine.

Python-Dev mailing list Python-Dev at python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/fuzzyman%40voidspace.org.uk

-- http://www.ironpythoninaction.com/

Previous message: [Python-Dev] Polymorphic best practices [was: (Not) delaying the 3.2 release]
Next message: [Python-Dev] Polymorphic best practices [was: (Not) delaying the 3.2 release]
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list