[Python-Dev] Prefetching on buffered IO files (original) (raw)

Guido van Rossum guido at python.org
Tue Sep 28 02:39:45 CEST 2010

Previous message: [Python-Dev] Prefetching on buffered IO files
Next message: [Python-Dev] Prefetching on buffered IO files
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Mon, Sep 27, 2010 at 3:41 PM, Antoine Pitrou <solipsis at pitrou.net> wrote:

While trying to solve #3873 (poor performance of pickle on file objects, due to the overhead of calling read() with very small values), it occurred to me that the prefetching facilities offered by BufferedIOBase are not flexible and efficient enough.

I haven't read the whole bug but there seem to be lots of different smaller issues there, right? It seems that one (unfortunate) constraint is that reading pickles cannot use buffered I/O (at least not on a non-seekable file) because the API has been documented to leave the file positioned right after the last byte of the pickled data, right?

Indeed, if you use seek() and read(), 1) you limit yourself to seekable files 2) performance can be hampered by very bad seek() performance (this is true on GzipFile).

Ow... I've always assumed that seek() is essentially free, because that's how a typical OS kernel implements it. If seek() is bad on GzipFile, how hard would it be to fix this?

How common is the use case where you need to read a gzipped pickle and you need to leave the unzipped stream positioned exactly at the end of the pickle?

If instead you use peek() and read(), the situation is better, but you end up doing multiple copies of data; also, you must call read() to advance the file pointer even though you don't care about the results.

Have you measured how bad the situation is if you do implement it this way?

So I would propose adding the following method to BufferedIOBase:

prefetch(self, buffer, skip, minread) Skip skip bytes from the stream. Then, try to read at least minread bytes and write them into buffer. The file pointer is advanced by at most skip + minread, or less if the end of file was reached. The total number of bytes written in buffer is returned, which can be more than minread if additional bytes could be prefetched (but, of course, cannot be more than len(buffer)). Arguments: - buffer: a writable buffer (e.g. bytearray) - skip: number of bytes to skip (must be >= 0) - minread: number of bytes to read (must be >= 0 and <= len(buffer))

I like the idea of an API that combines seek and read into a mutable buffer. However the semantics of this call seem really weird: there is no direct relationship between where it leaves the stream position and how much data it reads into the buffer. can you explain how exactly this will help solve the gzipped pickle performance problem?

Also, the BufferedIOBase ABC can then provide default implementations of read(), readinto() and peek(), simply by calling prefetch(). (how read1() can fit into the picture is not obvious)

What do you think?

Move to python-ideas?

-- --Guido van Rossum (python.org/~guido)

Previous message: [Python-Dev] Prefetching on buffered IO files
Next message: [Python-Dev] Prefetching on buffered IO files
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list