Issue 4958: email/header.py ecre regular expression issue (original) (raw)
Hello.
I have dedicated mail server at home and it holds about 1G of mail. Most of mail is in non UTF-8 codepage, so today I wrote little script that should recode all letters to UTF. But I found that email.header.decode_header parses some headers wrong.
For example, header Content-Type: application/x-msword; name="2008 =?windows-1251?B?wu7v8O7x+w==?= 2 =?windows-1251?B?4+7kIDgwONUwMC5kb2M=?=" parsed as [('application/x-msword; name="2008', None), ('\xc2\xee\xef\xf0\xee\xf1\xfb', 'windows-1251'), ('2 =?windows-1251?B?4+7kIDgwONUwMC5kb2M=?="', None)] that is obviously wrong.
Now I'm playing with email/header.py file in python 2.5 debian package (but it's same in 2.6.1 version except that all <> changed to !=).
If it's patched with ==================BEGIN CUT================== --- oldheader.py 2009-01-16 01:47:32.553130030 +0300 +++ header.py 2009-01-16 01:47:16.783119846 +0300 @@ -39,7 +39,6 @@ ? # literal ? (?P.*?) # non-greedy up to the next ?= is the encoded string ?= # literal ?=
- (?=[ \t]|$) # whitespace or the end of the string ''', re.VERBOSE | re.IGNORECASE | re.MULTILINE)
Field name regexp, including trailing colon, but not separating
whitespace, ==================END CUT================== it works fine.
So I wonder if this (?=[ \t]|$) # whitespace or the end of the string really needed, after all if there is only whitespaces after encoded word, its just appended to the list by
parts = ecre.split(line)
-- Also, there is related mail list thread: http://mail.python.org/pipermail/python-dev/2009-January/085088.html
Your example header is invalid. Excerpt from RFC2047 <http:// <www.ietf.org/rfc/rfc2047.txt>> section 5:
- An 'encoded-word' MUST NOT be used in parameter of a MIME Content-Type or Content-Disposition field, or in any structured field body except within a 'comment' or 'phrase'.
Even in the places where an "encoded word" (the sequence =?...?=) is allowed, it must always be surrounded by whitespace -- this is by design in the RFC.
If you have many of those invalid headers, you'll have to "cook" the output of decode_header, posibly detecting malformed sequences and calling decode_header again with just the offending substring.
I don't think that Python should accept malformed headers - but if you come to a good solution you may publish the recipe in the Python cookbook <http://www.activestate.com/ASPN/Python/Cookbook/>
I'd close this report as invalid.