[Python-Dev] Mailbox module - timings and functionality changes (original) (raw)
Guido van Rossum guido at python.org
Tue Jun 29 21:26:31 CEST 2010
- Previous message: [Python-Dev] Mailbox module - timings and functionality changes
- Next message: [Python-Dev] Mailbox module - timings and functionality changes
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
It should probably be opened in binary mode. Binary files do have a .readline() method (returning a bytes object), and bytes objects have a .startswith() method. The tell positions computed this way are even compatible with those used by the text file. So you could do it this way:
- open binary stream
- compute TOC by reading through it using .readline() and .tell()
- rewind (don't close)
- wrap the binary stream in a text stream
- use that for the rest of the code
--Guido
On Tue, Jun 29, 2010 at 10:54 AM, Steve Holden <steve at holdenweb.com> wrote:
A.M. Kuchling wrote:
On Tue, Jun 29, 2010 at 11:40:50AM -0400, Steve Holden wrote:
I will leave the profiler output to speak for itself, since I can find nothing much to say about it except that there's a hell of a lot of decoding going on inside mailbox.iterkeys().
The problem is actually in generatetoc(), which is reading through the entire file to figure out where all the 'From' lines that start messages are located. TextIOWrapper()'s tell() method seems to be very slow, so one help is to only call tell() when necessary; patch: -> svn diff Lib/ Index: Lib/mailbox.py =================================================================== --- Lib/mailbox.py (revision 82346) +++ Lib/mailbox.py (working copy) @@ -775,13 +775,14 @@ starts, stops = [], [] self.file.seek(0) while True: - linepos = self.file.tell() line = self.file.readline() if line.startswith('From '): + linepos = self.file.tell() if len(stops) < len(starts): stops.append(linepos - len(os.linesep)) starts.append(linepos) elif not line: + linepos = self.file.tell() stops.append(linepos) break self.toc = dict(enumerate(zip(starts, stops))) But should mailboxes really be opened in a UTF-8 encoding, or should they be treated as 7-bit text? I'll have to think about this. Neither! You can't open them as 7-bit text, because real-world email does contain bytes whose ordinal value exceeds 127. You can't open them using a text encoding because theoretically there might be ASCII headers that indicate that parts of the content are in specific character sets or encodings. If only we had a data structure that easily allowed us to manipulate 8-bit characters ... regards Steve -- Steve Holden +1 571 484 6266 +1 800 494 3119 See Python Video! http://python.mirocommunity.org/ Holden Web LLC http://www.holdenweb.com/ UPCOMING EVENTS: http://holdenweb.eventbrite.com/ "All I want for my birthday is another birthday" - Ian Dury, 1942-2000
Python-Dev mailing list Python-Dev at python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/guido%40python.org
-- --Guido van Rossum (python.org/~guido)
- Previous message: [Python-Dev] Mailbox module - timings and functionality changes
- Next message: [Python-Dev] Mailbox module - timings and functionality changes
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]