Issue 31089: email.utils.parseaddr fails on odd double quotes in multiline header (original) (raw)

Created on 2017-07-31 14:43 by robertus, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Messages (5)
msg299558 - (view) Author: Robert (robertus) Date: 2017-07-31 14:43
email.utils.parseaddr() does not successfully parse a field value into a (comment, address) pair if the FROM header has 2 lines (or more) containing odd number of double quotes in each of them. The address in such tuple is not e-mail address but a part of comment. For example: "=?UTF-8?Q?Anita_=W4=86ieckli=C5=84ska_|_PATO_Nieruch?= =?UTF-8?Q?omo=C5=9Bci?=" <anita.wiecklinska@pato.com.pl> is parsed into: ('', '=?UTF-8?Q?Anita_=W4=86ieckli=C5=84ska_ _PATO_Nieruch?=') Full example on Python 2.7.12, email 4.0.2: Python 2.7.12 (default, Nov 19 2016, 06:48:10) [GCC 5.4.0 20160609] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from email.utils import parseaddr >>> parseaddr('"=?UTF8?Q?Anita_=W4=86ieckli=C5=84ska_ _PATO_Nieruch?=\r\n =?UTF-8?Q?omo=C5=9Bci?=" <anita.wiecklinska@pato.com.pl>') ('', '=?UTF-8?Q?Anita_=W4=86ieckli=C5=84ska_
msg299568 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2017-07-31 15:32
parseaddr does what you expect if the message has been read using universal newline mode (ie: the linesep is \n): >>> parseaddr('"=?UTF-8?Q?Anita_=W4=86ieckli=C5=84ska_|_PATO_Nieruch?=\n =?UTF-8?Q?omo=C5=9Bci?=" <anita.wiecklinska@pato.com.pl>"') ('=?UTF-8?Q?Anita_=W4=86ieckli=C5=84ska_ _PATO_Nieruch?=\n =?UTF-8?Q?omo=C5=9Bci?=', 'anita.wiecklinska@pato.com.pl') I suppose this wouldn't be *that* hard to fix. If it isn't too complex and you want to propose a patch I'll take a look. In any case it works fine in python3 using the new policies: >>> from email import message_from_string as mfs >>> from email.policy import default >>> m = mfs('From: "=?UTF-8?Q?Anita_=W4=86ieckli=C5=84ska_ _PATO_Nieruch?=\r\n =?UTF-8?Q?omo=C5=9Bci?=" <anita.wiecklinska@pato.com.pl>"\r\n\r\ntest', policy=default) >>> m['from'].addresses (Address(display_name='Anita =W4\udc86ieckliƄska
msg299569 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2017-07-31 15:41
Ah, I take it back. With \n it retains the \n in the decoded name field. There is a bug of some sort here (\r\n should be treated the same as \n, I think, whatever way it is treated). I don't think this is worth addressing, given that the new policies provide a much better API for interacting with Messages, and you can in fact easily unfold the line before parsing it if you need to do it in 2.7: >>> parseaddr(''.join(m['from'].splitlines())) ('=?UTF-8?Q?Anita_=W4=86ieckli=C5=84ska_|_PATO_Nieruch?= =?UTF-8?Q?omo=C5=9Bci?=', 'anita.wiecklinska@pato.com.pl')
msg299616 - (view) Author: Robert (robertus) Date: 2017-08-01 09:29
RFC regarding this topic looks quite complicated to me, but I know that \r\n is used for line breaking in e-mail headers and \n is not. So in my opinion it shouldn't be treated the same like \n. The \r\n should be removed in parsed text, but \n should be preserved like any other character. So I don't think "universal newline mode" is correct approach to read raw e-mails. I have tested policies in python3 - you have right - it works. But I cannot use it because of application incompatibility with python3. I was hoping it will be easy to fix for some more experienced than me... If not - you can close issue and I will stay with present solution (removing \r\n). Thanks for all your help!
msg299621 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2017-08-01 15:34
Yes, that is mostly likely why parseaddr operates the way it does. The old email package does not do very much hand-holding, it expects you to understand the RFCs, which as you note is a rather daunting task. The new email package (the new policies) in python3 aim to incorporate as much understanding of the RFCs into the library as possible and "do the right thing" automatically so you don't have to worry about it (it can't hide 100%, though...). As for universal new line mode, you are correct that technically \n by itself is data per the RFC (and illegal in the middle of a quoted string like that), but the way Python handles "text" is to convert \r\n into \n internally. So while parseaddr is doing the "right thing" per the RFC, the input parsing parts of the email package in fact accept \n or even mixed line endings to accommodate the difference between unix/python line endings and RFC line endings.
History
Date User Action Args
2022-04-11 14:58:49 admin set github: 75272
2017-08-01 15:34:23 r.david.murray set messages: +
2017-08-01 09:29:36 robertus set status: open -> closedresolution: wont fixmessages: + stage: resolved
2017-07-31 15:41:58 r.david.murray set messages: +
2017-07-31 15:32:47 r.david.murray set messages: +
2017-07-31 14:43:59 robertus set nosy: + barry, r.david.murraytype: behaviorcomponents: + email
2017-07-31 14:43:18 robertus create