Issue 13698: Mailbox module should support other mbox formats in addition to mboxo (original) (raw)

Created on 2012-01-02 21:46 by endolith, last changed 2022-04-11 14:57 by admin.

Messages (8)
msg150478 - (view) Author: (endolith) Date: 2012-01-02 21:46
The documentation states: "Several variations of the mbox format exist to address perceived shortcomings in the original. In the interest of compatibility, mbox implements the original format, which is sometimes referred to as mboxo." http://docs.python.org/dev/library/mailbox.html#mailbox.mbox But this format is fundamentally broken, corrupting lines that start with "From ", and I can't find any justification for using it in Python. In fact, all four links included in that section argue against this format. If only one mbox format is used, it should be mboxrd. Otherwise, include support for all the variants, with mboxrd as the default.
msg150479 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2012-01-02 21:55
Well, supporting the other variants would be good (I'll review any proposed patches), but I think the default will have to stay mboxo for backward compatibility reasons (unless the consensus is to go through the warning/deprecation cycle to change it). As a new feature, this could only go into 3.3 or later.
msg159625 - (view) Author: (endolith) Date: 2012-04-29 16:26
Ok. I'm not sure what backwards compatibility issues would exist, though. The only difference is that mboxrd converts "\nFrom " → "\n>From " "\n>From " → "\n>>From " making the conversion reversible, while mboxo does "\nFrom " → "\n>From " "\n>From " → "\n>From " (no change) which is ambiguous, and both get converted back to "\nFrom " when converting back to text, corrupting the original message. mboxrd is essentially a bugfix for mboxo rather than a fundamentally different format.
msg159629 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2012-04-29 16:55
If that's really the only difference we might indeed be able to treat it as a bug fix. I'd have to look at a proposed patch to be sure.
msg163359 - (view) Author: Petri Lehtinen (petri.lehtinen) * (Python committer) Date: 2012-06-21 18:51
I'm a little concerned about backwards compatibility. Someone might get upset if extra >'s start appearing in the messages when they read the mailbox contents with an application that uses the mboxo format. A little analysis on the possible corruptions that happen with these formats: - When the mailbox is both read and written using the mboxo format, lines starting with "From " are changed to ">From ". - When the mailbox is both read and written using the mboxrd format, no corruption happens. - If the mailbox is written using the mboxo format and read using the mboxrd format, lines that were meant to start with ">From " are changed to "From ". So we essentially get a sligthly different corruption. - If the mailbox is written using the mboxrd format and read using the mboxo format, lines that were meant to start with ">From " are changed to ">>From ". This is a new type of corruption.
msg163904 - (view) Author: Petri Lehtinen (petri.lehtinen) * (Python committer) Date: 2012-06-25 06:21
The default mode for reading mbox files should also be modified a bit to maximize the support fordifferent implementations. See #11728. I think we should still use the mboxo format by default when writing, and the "default" format of RFC 4155 when reading. We could then add a "format" parameter to the mbox constructor to alter the writing and/or reading behavior to match a specific mbox format. According to RFC 4155, the best reference for different mbox formats is http://qmail.org./man/man5/mbox.html.
msg163975 - (view) Author: (endolith) Date: 2012-06-25 14:44
> - If the mailbox is written using the mboxrd format and read using the mboxo format, lines that were meant to start with ">From " are changed to ">>From ". This is a new type of corruption. Well, yes. So the choices are: mboxrd as default: Sometimes results in corruption mboxo as default: Always results in corruption Is there a way to reliably detect the format of the file and produce an error if it seems to be reading it wrong? If not, maybe just include a function that guesses the format so the correct option can be found easily? If there are consecutive ">" quoted lines, like this, for instance: >This is the body. >>From my point of view >there are 3 lines. then it was probably encoded with mboxrd? If instead you find: >This is the body. >From my point of view >there are 3 lines. then it was probably encoded with mboxo?
msg164002 - (view) Author: Petri Lehtinen (petri.lehtinen) * (Python committer) Date: 2012-06-25 18:59
endolith wrote: > > - If the mailbox is written using the mboxrd format and read using > > - the mboxo format, lines that were meant to start with ">From " > > - are changed to ">>>From ". This is a new type of corruption. > > Well, yes. So the choices are: > > mboxrd as default: Sometimes results in corruption > mboxo as default: Always results in corruption I don't think so. Assuming that mboxo (the current default) was used to write the mailbox, both formats sometimes result in corruption. mboxo as default: "From " lines get written (and subsequently read) as ">From ". mboxrd as default: ">From " lines were written as ">From " but are read as "From ". Furthermore, if Python's mailbox module is used to write the mbox file and another software, that only supports mboxo, is used to read it (e.g. mutt), having mboxrd as the default would case ">From " lines to be written as ">>From ". These linew would then be read as ">>From " by the reading software. So, I'd like to keep the default as is, and add a parameter to change to mboxrd when it's OK for the use case at hand. We should also clearly document that mboxrd is recommended as it never corrupts data if used for both reading and writing. > Is there a way to reliably detect the format of the file and produce > an error if it seems to be reading it wrong? > > If not, maybe just include a function that guesses the format so the > correct option can be found easily? If there are consecutive ">" > quoted lines, like this, for instance: > > >This is the body. > >>From my point of view > >there are 3 lines. > > then it was probably encoded with mboxrd? If instead you find: > > >This is the body. > >From my point of view > >there are 3 lines. > > then it was probably encoded with mboxo? It's not possible to automatically detect the format. Guessing like you suggested is too fragile. It might work on some situations, but wouldn't work on others. If it was possible to detect the format by guessing, I'm sure RFC 4155 would mention that, as it aims for the best possible outcome for reading any of the formats, without knowing which format is actually in use.
History
Date User Action Args
2022-04-11 14:57:25 admin set github: 57907
2012-06-25 18:59:03 petri.lehtinen set messages: +
2012-06-25 14:44:54 endolith set messages: +
2012-06-25 06:21:59 petri.lehtinen set nosy: + barrymessages: + components: + email
2012-06-21 18:51:48 petri.lehtinen set messages: +
2012-06-21 10:52:19 petri.lehtinen set nosy: + petri.lehtinenversions: + Python 3.4, - Python 3.3
2012-04-29 16:55:11 r.david.murray set messages: +
2012-04-29 16:26:40 endolith set messages: +
2012-01-02 21:55:03 r.david.murray set versions: - Python 2.7type: behavior -> enhancementnosy: + r.david.murraytitle: Mailbox module should not use mboxo format -> Mailbox module should support other mbox formats in addition to mboxomessages: + stage: needs patch
2012-01-02 21:47:36 endolith set title: Should not use mboxo format -> Mailbox module should not use mboxo format
2012-01-02 21:46:27 endolith create