Issue 20083: smtplib: support for IDN (international domain names) (original) (raw)

Created on 2013-12-28 01:49 by macfreek, last changed 2022-04-11 14:57 by admin.

Messages (7)
msg207017 - (view) Author: Freek Dijkstra (macfreek) Date: 2013-12-28 01:49
smtplib has limited support for non-ASCII domain names in the From to To mail address. It only works for punycode-encoded domain names, submitted as unicode string (e.g. server.rcpt(u"user@xn--e1afmkfd.ru"). The following two calls fail: server.rcpt(u"user@пример.ru"): File smtplib.py, line 332, in send s = s.encode("ascii") UnicodeEncodeError: 'ascii' codec can't encode character '\u03c0' in position 19: ordinal not in range(128) http://hg.python.org/cpython/file/3.3/Lib/smtplib.py#l332 server.rcpt(b"user@xn--e1afmkfd.ru"): File email/_parseaddr.py, line 236, in gotonext if self.field[self.pos] in self.LWS + '\n\r': TypeError: 'in ' requires string as left operand, not int http://hg.python.org/cpython/file/3.3/Lib/email/_parseaddr.py#l236 There are three ways to solve this (from trivial to complex): * Make it clear in the documentation what type of input is expected. * Accept punycode-encoded domain names in email addresses, either in string or binary format. * Accept Unicode-encoded domain names, and do the punycode encoding in the smtplib if required. See also References: https://tools.ietf.org/html/rfc5891: Internationalized Domain Names in Applications (IDNA): Protocol
msg207019 - (view) Author: Freek Dijkstra (macfreek) Date: 2013-12-28 01:53
This issue deals with international domain names in email addresses (the part behind the "@"). See issue 20084 for the issue that deals with the part before the "@".
msg207041 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-12-28 17:46
Thanks for the suggestion. Once the issue 11783 patch is committed, smtplib can be changed to use formataddr in quoteaddr, which will result in the domain being punycoded automatically. (It's too bad I forgot about that issue, since the 3.4 beta deadline has already passed :( The input to the commands is string, not bytes, so you can already pre-encode yourself, as you noted. The commands don't accept bytes, and should not, since the data they cause to be sent on the wire may not contain non-ASCII characters; there is thus no need to generate binary. SMTPUTF8 will of course require generating binary data in these contexts, but in that case the correct way to generate the binary is by utf-8 encoding the unicode input, so there will again be no reason for the commands to accept binary input, and it will be better if they don't. (If you need to generate invalid data, say for a test scenario, you can drop down to executing 'send' calls manually.) (Note: using the 'u' prefix in python3, while supported for backward compatibility, is only confusing when used outside of that context...I thought you were talking about 2.7 until I read carefully.)
msg207044 - (view) Author: Freek Dijkstra (macfreek) Date: 2013-12-28 19:16
Great to hear that a patch already exists (sorry I couldn't find in in the tracker). Feel free to close this issue as duplicate of issue 11783. (As for the u"string", I wanted to distinguish it from b'string'. I don't use it in code (since the backward compatibility is only present in 3.3+, not in 3.2). Sorry for the confusion.)
msg207045 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-12-28 19:18
No, that issue is about the email library. So we need this one too for the equivalent enhancement to smtplib.
msg207053 - (view) Author: Freek Dijkstra (macfreek) Date: 2013-12-28 20:44
Since smtplib.quoteaddr() uses email.utils.parseaddr(), and the patch for issue 11783 fixes email.utils.parseaddr(), that patch will hopefully solve this issue as well (though a test case wouldn't hurt for sure). What I had not realised is that hostnames are also used elsewhere, in particular in the ehlo() and helo() but also in connect(). Do you consider that a separate issue or part of this issue? Are there other places where you think a fix is needed? I may be able to create a patch, though bear with me: I never checked out the source for Python or the standard library (other than installing point releases through my package manager).
msg207064 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-12-29 02:12
A call to formataddr will need to be added to quoteaddr. And yes, test cases are needed. I don't believe that the format of the HELO/EHLO message is defined by the RFC, so I don't think we can automatically parse it. I think we just have to leave the domain name encoded as punycode there. Regardless, though, yes I would consider that a separate issue. If you want to work on a patch, that would be great. For guidance on doing so, you can take a look at http://docs.python.org/devguide. You can also help me to remember to commit 11783 after the final release of 3.4.0.
History
Date User Action Args
2022-04-11 14:57:56 admin set github: 64282
2014-06-12 20:23:23 zvyn set nosy: + jesstess, zvyn
2013-12-29 02:12:48 r.david.murray set messages: +
2013-12-28 20:44:17 macfreek set messages: +
2013-12-28 19🔞45 r.david.murray set resolution: duplicate -> messages: +
2013-12-28 19:16:39 macfreek set resolution: duplicatemessages: +
2013-12-28 17:46:01 r.david.murray set versions: + Python 3.5nosy: + barry, r.david.murraymessages: + dependencies: + email parseaddr and formataddr should be IDNA awarecomponents: + email
2013-12-28 01:53:10 macfreek set messages: +
2013-12-28 01:49:23 macfreek create