[Python-Dev] Polymorphic best practices [was: (Not) delaying the 3.2 release] (original) (raw)
R. David Murray rdmurray at bitdance.com
Thu Sep 16 22:51:58 CEST 2010
- Previous message: [Python-Dev] Polymorphic best practices [was: (Not) delaying the 3.2 release]
- Next message: [Python-Dev] Polymorphic best practices [was: (Not) delaying the 3.2 release]
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Thu, 16 Sep 2010 17:40:53 +0200, Antoine Pitrou <solipsis at pitrou.net> wrote:
On Thu, 16 Sep 2010 11:30:12 -0400 "R. David Murray" <rdmurray at bitdance.com> wrote: > > And then BaseHeader uses self.lit.colon, etc, when manipulating strings. > It also has to use slice notation rather than indexing when looking at > individual characters, which is a PITA but not terrible. > > I'm not saying this is the best approach, since this is all experimental > code at the moment, but it is an approach....
Out of curiousity, can you explain why polymorphism is needed for e-mail? I would assume that headers are bytes until they are parsed, at which point they become a pair of unicode strings (one for the header name and one for its value).
Currently email accepts strings as input, and produces strings as output.
It needs to also accept bytes as input, and emit bytes as output, because unicode can only be used as a 7-bit clean data transmission channel, and that's too restrictive for many email applications (many of which need to deal with "dirty" (non-RFC conformant) 8bit data. [1]
Backward compatibility says "case closed".
If we were designing from scratch, we could insist that input to the parser is always bytes, and when the model is serialized it always produces bytes. It is possible that one could live with that, but I don't think it is optimal.
Given a message, there are many times you want to serialize it as text (for example, for presentation in a UI). You could provide alternate serialization methods to get text out on demand....but then what if someone wants to push that text representation back in to email to rebuild a model of the message? So now we have both a bytes parser and a string parser.
What do we store in the model? We could say that the model is always text. But then we lose information about the original bytes message, and we can't reproduce it. For various reasons (mailman being a big one), this is not acceptable. So we could say that the model is always bytes. But we want access to (for example) the header values as text, so header lookup should take string keys and return string values[2]. But for certain types of processing, particularly examination of "dirty", non-RFC conforming input data, you need to be able to access the raw bytes data.
What about email files on disk? They could be bytes, or they could be, effectively, text (for example, utf-8 encoded). On disk, using utf-8, one might store the text representation of the message, rather than the wire-format (ASCII encoded) version. We might want to write such messages from scratch. As I said above, we could insist that files on disk be in wire-format, and for many applications that would work fine, but I think people would get mad at us if didn't support text files[3].
So, after much discussion, what we arrived at (so far!) is a model that mimics the Python3 split between bytes and strings. If you start with bytes input, you end up with a BytesMessage object. If you start with string input to the parser, you end up with a StringMessage. If you have a BytesMessage and you want to do something with the text version of the message, you decode it:
print(mymsg.decode())
If the message is RFC conformant, the message contains all the information needed to decode it correctly. If its not conformant, email does the best it can and registers defects for the non-conformant bits (or, optionally, email6 will raise errors when the policy is set to strict).
If you have a StringMessage and you want to use it where wire-format is needed, you encode it:
outmsg = mymsg.encode()
smtpserver.sendmail(
bytes(outmsg['from']),
[bytes(x) for x in itertools.chain(
outmsg['to'], outmsg['cc'], outmsg['bcc'])],
outmsg.serialize(policy=email.policy.SMTP))
Encoding uses the utf-8 character set by default, but this can be modified by changing the policy. The trick for gathering the list of addresses is how I think that part of the API is going to work: iterating the object that models an address header gives you a list of address objects, and converting one of those to a bytes string gives you the wire-format byte string representing a single address. Also note that this is the new API; in compatibility mode (which is controlled by the policy) you'd get the old behavior of just getting the string representation of the whole header back (but then you'd have to parse it to turn it into a list of addresses).
The point here is that because we've encoded the message to a BytesMessage, what we get when we turn the pieces into a bytes string are the wire-format byte strings that are required for transmission; for example, non-ASCII characters will be encoded according to the policy and then RFC2047 transfer encoded as needed.
At this point you may notice there's a problem with the example above. We actually need to decode each of those byte strings using the ASCII codec before passing them as arguments to smtplib, since smtplib in Python3 expects string arguments. If smtplib were polymorphic we could pass in the bytes strings directly. In that case if a string were passed in instead, smtplib could call some utility routines from email to encode the text into bytes using the RFC2047 conventions. As it stands now, there's no easy way for a user program to construct a list of addresses that require RFC2047 encoding and pass it to smtplib. (This last item is just as much a problem in Python2, by the way.)
This is probably not the right thing to do, though, because that isn't the kind of polymorphism we're talking about. When accepting input to sendmail, smtplib is always bytes out, so having it accept both bytes and strings as input is probably wrong[4]. Especially since the message body needs to be passed in in wire-format, because smtplib should not have to know how to convert text into wire-format...that's the email module's job.
Instead smtplib could take a Message object as input, and do that serialize call itself. In which case it could also figure out the addresses by itself, and/or accept email address objects for the from and to parameters.
You can see what a can of worms this stuff is :) This is what I meant about carefully examining the API contract before blindly providing polymorphism. For email, a wire-format bytes string contains encoding information, and you have to stay aware of that as you redesign the bytes/string interface.
Anyway, what polymorphism means in email is that if you put in bytes, you get a BytesMessage, if you put in strings you get a StringMessage, and if you want the other one you convert.
I'm giving consideration to additional polymorphism, such as having the use of a key of a particular type return a value of that type. That is, looking up the subject by the key 'subject' would get you a StringHeader regardless of whether you were looking it up in a BytesMessagge or a StringMessage. But I'm still thinking about whether or not that is a good idea, I need to write up some more example code to convince myself one way or another. The sendmail example above is an example on the "no" side: you'll note that in that example the natural thing to do was to use string keys, but get bytes out.
Well, that was probably more than you wanted to know or read, but hopefully it will give some perspective on what's involved here.
Feedback on any of this is welcome. I've got a hole in my schedule next week that I'm planning on filling with email6 work, so any feedback will all be grist for the mill. Anyone interested should also sign up for the email-sig mailing list and provide feedback when I start posting there again (which, as I said, should be next week).
-- R. David Murray www.bitdance.com
[1] Now that surrogateesscape exists, one might suppose that strings could be used as an 8bit channel, but that only works if you don't need to parse the non-ASCII data, just transmit it. email needs to parse it. In theory email6 could decode to bytes using surrougateescape and then process, but the infrastructure to handle that still looks like what is described above, so it makes more sense to accept bytes directly.
[2] actually they the return StringHeader objects, but the principle is the same.
[3] note that you can also have 7bit clean wire format messages stored on disk as text. These can be read as text, but my current thought is that you must give them to email as bytes (that is, encode them using the ASCII codec). For email6 in the current design, bytes means wire format, and text/string means fully decoded.
[4] unless the strings consist only of 7bit clean wire-format ASCII characters, as is required now. So what we will probably end up with is smtplib.sendmail accepting both bytes and strings for backward compatibility, but string input must continue to be (the equivalent of) the ASCII decode of 7bit clean wire-format data.
- Previous message: [Python-Dev] Polymorphic best practices [was: (Not) delaying the 3.2 release]
- Next message: [Python-Dev] Polymorphic best practices [was: (Not) delaying the 3.2 release]
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]