msg132657 - (view) |
Author: valera (wally1980) |
Date: 2011-03-31 12:04 |
mailbox.mbox parser is splitting mbox files by "^From " pattern, which is wrong , in fairy it should split mbox by "\nFrom ". Illustration: ------ From bla-blah@localhost Header1 Header2 body1 body2 From blah-blah2@localhost Header1 body1 From your dear friend body3 ------ This mbox would be splitted in 3 messages instead of 2 |
|
|
msg132671 - (view) |
Author: R. David Murray (r.david.murray) *  |
Date: 2011-03-31 14:13 |
All the references I could find talk about triggering the match without the proceeding newline. That is, it is not certain that a blank line will precede the 'From ' header, and the typical quoting rules for mbox format call for any 'From ' at the start of a line (whether preceded by a blank line or not) to be quoted. This might have something to do with the fact that otherwise you have to special case the first line of the mbox, but I don't really know. What tool are you using that is producing the unquoted 'From ' lines in your mbox? I know there are variants on the mbox format, so if one of them has the format you propose, this would become a feature request to support that variant mbox format. |
|
|
msg132687 - (view) |
Author: valera (wally1980) |
Date: 2011-03-31 16:48 |
On Thu, 31 Mar 2011 14:13:50 +0000 "R. David Murray" <report@bugs.python.org> wrote: > > R. David Murray <rdmurray@bitdance.com> added the comment: > > All the references I could find talk about triggering the match > without the proceeding newline. That is, it is not certain that a > blank line will precede the 'From ' header, and the typical quoting > rules for mbox format call for any 'From ' at the start of a line > (whether preceded by a blank line or not) to be quoted. This might > have something to do with the fact that otherwise you have to special > case the first line of the mbox, but I don't really know. > > What tool are you using that is producing the unquoted 'From ' lines > in your mbox? I know there are variants on the mbox format, so if > one of them has the format you propose, this would become a feature > request to support that variant mbox format. > > ---------- > nosy: +r.david.murray > Hello, David ! This is an email from netcraft mailing list - the host which accepted it is running sendmail with some antivirus software on top - mimedefang + spamassassin from what I know. Could be tat something is broken in that chain, I've spotted the error when I was writing the script for mailbox --> maildir conversion, while migrating this server. So I had to inherit mailbox.mbox and fix as I need, I'll investigate further what lead to such behaviour. Nevertheless, here is snippet from rfc4155 - In order to improve interoperability among messaging systems, this memo defines a "default" mbox database format, which MUST be supported by all implementations that claim to be compliant with this specification. The "default" mbox database format uses a linear sequence of Internet messages, with each message being immediately prefaced by a separator line, and being terminated by an empty line. --- So I think assuming that there should be an empty line before "From " separator line is fine (for the second email and further) and would help to deal with all kinds of mbox mailboxes, fix is rather trivial. Best regards, Valery Masiutsin |
|
|
msg138245 - (view) |
Author: Steffen Daode Nurpmeso (sdaoden) |
Date: 2011-06-13 13:56 |
Hello Valery Masiutsin, i recently stumbled over this while searching for the link to the standart i've stored in another issue. (Without being logged in, say.) The de-facto standart (http://qmail.org/man/man5/mbox.html) says: HOW A MESSAGE IS READ A reader scans through an mbox file looking for From_ lines. Any From_ line marks the beginning of a message. The reader should not attempt to take advantage of the fact that every From_ line (past the beginning of the file) is preceded by a blank line. This is however the recent version. The "mbox" manpage of my up-to-date Mac OS X 10.6.7 does not state this, for example. It's from 2002. However, all known MBOX standarts, i.e. MBOXO, MBOXRD, MBOXCL, require proper quoting of non-From_ "From " lines (by preceeding with '>'). So your example should not fail in Python. (But hey - are you sure *that* has been produced by Perl?) You're right however that Python seems to only support the old MBOXO way of un-escaping only plain "From " to/from ">From ", which is not even mentioned anymore in the current standart - that only describes MBOXRD ("(>*From )" -> ">"+match.group(1)). (Lucky me: i own Mac OS X, otherwise i wouldn't even know.) Thus you're in trouble if the unescaping is performed before the split.. This is another issue, though: "MBOX parser uses MBOXO algorithm". ;> - Ciao, Steffen |
|
|
msg163812 - (view) |
Author: Petri Lehtinen (petri.lehtinen) *  |
Date: 2012-06-24 17:41 |
It seems to me that "^From " is the correct way to match the start of each message. This is also what the qmail manual page (linked in the previous message) says. So closing as invalid. |
|
|
msg163872 - (view) |
Author: valera (wally1980) |
Date: 2012-06-24 23:03 |
Hello Petri Qmail manpage does not sound as a valid reference for me, I've pointed relevant RFC (which dictates correct behaviour) as a reference, python mbox parser does not conform to it. Best regards, Valery Masiutsin On Sun, Jun 24, 2012 at 6:41 PM, Petri Lehtinen <report@bugs.python.org>wrote: > > Petri Lehtinen <petri@digip.org> added the comment: > > It seems to me that "^From " is the correct way to match the start of each > message. This is also what the qmail manual page (linked in the previous > message) says. So closing as invalid. > > ---------- > nosy: +petri.lehtinen > resolution: -> invalid > stage: test needed -> committed/rejected > status: open -> closed > > _______________________________________ > Python tracker <report@bugs.python.org> > <http://bugs.python.org/issue11728> > _______________________________________ > |
|
|
msg163902 - (view) |
Author: Petri Lehtinen (petri.lehtinen) *  |
Date: 2012-06-25 06:15 |
Actually, you're right. Sorry for overlooking the RFC. But that said, the RFC itself refers to the same manpage as a reference that's "mostly authoritative for those variations that are otherwise only documented in anecdotal form". So I guess it's quite a good reference after all :) In Appendix A, RFC 4155 defines a set of rules for a "default" mbox format that maximizes interoperability between different mbox implementations. The important things in the RFC concerning this issue are: * There MUST be an empty line after each message. * The RFC does not specify any escape syntax for message body lines starting with "From ". It says: "Recipient systems are expected to parse full separator lines as they are documented above." Because the RFC states that there must be an empty line after each message, and it aims for maximum interoperability, I think we can assume that there always is an empty line there. But looking for "\n\nFrom " is not enough for finding the starting points of messages. We should actually parse the whole separator line which consists of "From ", an email address (addr-spec in RFC 2822), a timestamp (in UNIX ctime format without timezone), and a newline character. I think this should be the default mode for reading mbox files. See #13698 for adding support for other formats. |
|
|
msg164636 - (view) |
Author: Petri Lehtinen (petri.lehtinen) *  |
Date: 2012-07-04 04:24 |
Some thoughts on doing "clever tricks" to enhance mbox parsing: http://www.jwz.org/doc/content-length.html |
|
|