Issue 31089: email.utils.parseaddr fails on odd double quotes in multiline header (original) (raw)

Created on 2017-07-31 14:43 by robertus, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Messages (5)
msg299558 - (view)	Author: Robert (robertus)	Date: 2017-07-31 14:43
email.utils.parseaddr() does not successfully parse a field value into a (comment, address) pair if the FROM header has 2 lines (or more) containing odd number of double quotes in each of them. The address in such tuple is not e-mail address but a part of comment. For example: "=?UTF-8?Q?Anita_=W4=86ieckli=C5=84ska_\|_PATO_Nieruch?= =?UTF-8?Q?omo=C5=9Bci?=" <anita.wiecklinska@pato.com.pl> is parsed into: ('', '=?UTF-8?Q?Anita_=W4=86ieckli=C5=84ska_	_PATO_Nieruch?=') Full example on Python 2.7.12, email 4.0.2: Python 2.7.12 (default, Nov 19 2016, 06:48:10) [GCC 5.4.0 20160609] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from email.utils import parseaddr >>> parseaddr('"=?UTF8?Q?Anita_=W4=86ieckli=C5=84ska_	_PATO_Nieruch?=\r\n =?UTF-8?Q?omo=C5=9Bci?=" <anita.wiecklinska@pato.com.pl>') ('', '=?UTF-8?Q?Anita_=W4=86ieckli=C5=84ska_
msg299568 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2017-07-31 15:32
parseaddr does what you expect if the message has been read using universal newline mode (ie: the linesep is \n): >>> parseaddr('"=?UTF-8?Q?Anita_=W4=86ieckli=C5=84ska_\|_PATO_Nieruch?=\n =?UTF-8?Q?omo=C5=9Bci?=" <anita.wiecklinska@pato.com.pl>"') ('=?UTF-8?Q?Anita_=W4=86ieckli=C5=84ska_	_PATO_Nieruch?=\n =?UTF-8?Q?omo=C5=9Bci?=', 'anita.wiecklinska@pato.com.pl') I suppose this wouldn't be that hard to fix. If it isn't too complex and you want to propose a patch I'll take a look. In any case it works fine in python3 using the new policies: >>> from email import message_from_string as mfs >>> from email.policy import default >>> m = mfs('From: "=?UTF-8?Q?Anita_=W4=86ieckli=C5=84ska_	_PATO_Nieruch?=\r\n =?UTF-8?Q?omo=C5=9Bci?=" <anita.wiecklinska@pato.com.pl>"\r\n\r\ntest', policy=default) >>> m['from'].addresses (Address(display_name='Anita =W4\udc86iecklińska
msg299569 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2017-07-31 15:41
Ah, I take it back. With \n it retains the \n in the decoded name field. There is a bug of some sort here (\r\n should be treated the same as \n, I think, whatever way it is treated). I don't think this is worth addressing, given that the new policies provide a much better API for interacting with Messages, and you can in fact easily unfold the line before parsing it if you need to do it in 2.7: >>> parseaddr(''.join(m['from'].splitlines())) ('=?UTF-8?Q?Anita_=W4=86ieckli=C5=84ska_\|_PATO_Nieruch?= =?UTF-8?Q?omo=C5=9Bci?=', 'anita.wiecklinska@pato.com.pl')
msg299616 - (view)	Author: Robert (robertus)	Date: 2017-08-01 09:29
RFC regarding this topic looks quite complicated to me, but I know that \r\n is used for line breaking in e-mail headers and \n is not. So in my opinion it shouldn't be treated the same like \n. The \r\n should be removed in parsed text, but \n should be preserved like any other character. So I don't think "universal newline mode" is correct approach to read raw e-mails. I have tested policies in python3 - you have right - it works. But I cannot use it because of application incompatibility with python3. I was hoping it will be easy to fix for some more experienced than me... If not - you can close issue and I will stay with present solution (removing \r\n). Thanks for all your help!
msg299621 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2017-08-01 15:34
Yes, that is mostly likely why parseaddr operates the way it does. The old email package does not do very much hand-holding, it expects you to understand the RFCs, which as you note is a rather daunting task. The new email package (the new policies) in python3 aim to incorporate as much understanding of the RFCs into the library as possible and "do the right thing" automatically so you don't have to worry about it (it can't hide 100%, though...). As for universal new line mode, you are correct that technically \n by itself is data per the RFC (and illegal in the middle of a quoted string like that), but the way Python handles "text" is to convert \r\n into \n internally. So while parseaddr is doing the "right thing" per the RFC, the input parsing parts of the email package in fact accept \n or even mixed line endings to accommodate the difference between unix/python line endings and RFC line endings.

History
Date	User	Action	Args
2022-04-11 14:58:49	admin	set	github: 75272
2017-08-01 15:34:23	r.david.murray	set	messages: +
2017-08-01 09:29:36	robertus	set	status: open -> closedresolution: wont fixmessages: + stage: resolved
2017-07-31 15:41:58	r.david.murray	set	messages: +
2017-07-31 15:32:47	r.david.murray	set	messages: +
2017-07-31 14:43:59	robertus	set	nosy: + barry, r.david.murraytype: behaviorcomponents: + email
2017-07-31 14:43:18	robertus	create