[Python-Dev] Mailbox module - timings and functionality changes (original) (raw)
Steve Holden steve at holdenweb.com
Tue Jun 29 23:02:14 CEST 2010
- Previous message: [Python-Dev] Mailbox module - timings and functionality changes
- Next message: [Python-Dev] Mailbox module - timings and functionality changes
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Guido van Rossum wrote:
It should probably be opened in binary mode. Binary files do have a .readline() method (returning a bytes object), and bytes objects have a .startswith() method. The tell positions computed this way are even compatible with those used by the text file. So you could do it this way:
- open binary stream - compute TOC by reading through it using .readline() and .tell() - rewind (don't close)
Because closing is inefficient, or because it breaks the algorithm?
- wrap the binary stream in a text stream
"wrap" how? The ultimate destiny of the text is twofold:
- To be stored as some kind of LOB in a database, and
- Therefrom to be reconstituted and parsed into email.Message objects.
Is the wrapping a one-off operation or a software layer? Sorry, being a bit dense here, I know.
regards Steve
- use that for the rest of the code > --Guido > On Tue, Jun 29, 2010 at 10:54 AM, Steve Holden <steve at holdenweb.com> wrote: > A.M. Kuchling wrote: >> On Tue, Jun 29, 2010 at 11:40:50AM -0400, Steve Holden wrote: >>> I will leave the profiler output to speak for itself, since I can find >>> nothing much to say about it except that there's a hell of a lot of >>> decoding going on inside mailbox.iterkeys(). >> The problem is actually in generatetoc(), which is reading through >> the entire file to figure out where all the 'From' lines that start >> messages are located. TextIOWrapper()'s tell() method seems to be >> very slow, so one help is to only call tell() when necessary; patch: >>>>> -> svn diff Lib/ >> Index: Lib/mailbox.py >> =================================================================== >> --- Lib/mailbox.py (revision 82346) >> +++ Lib/mailbox.py (working copy) >> @@ -775,13 +775,14 @@ >> starts, stops = [], [] >> self.file.seek(0) >> while True: >> - linepos = self.file.tell() >> line = self.file.readline() >> if line.startswith('From '): >> + linepos = self.file.tell() >> _if len(stops) < len(starts):_ >> stops.append(linepos - len(os.linesep)) >> starts.append(linepos) >> elif not line: >> + linepos = self.file.tell() >> stops.append(linepos) >> break >> self.toc = dict(enumerate(zip(starts, stops))) >>>>> But should mailboxes really be opened in a UTF-8 encoding, or should >> they be treated as 7-bit text? I'll have to think about this. > Neither! You can't open them as 7-bit text, because real-world email > does contain bytes whose ordinal value exceeds 127. You can't open them > using a text encoding because theoretically there might be ASCII headers > that indicate that parts of the content are in specific character sets > or encodings. >>> If only we had a data structure that easily allowed us to manipulate > 8-bit characters ... >>> regards > Steve
Steve Holden +1 571 484 6266 +1 800 494 3119 See Python Video! http://python.mirocommunity.org/ Holden Web LLC http://www.holdenweb.com/ UPCOMING EVENTS: http://holdenweb.eventbrite.com/ "All I want for my birthday is another birthday" - Ian Dury, 1942-2000
- Previous message: [Python-Dev] Mailbox module - timings and functionality changes
- Next message: [Python-Dev] Mailbox module - timings and functionality changes
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]