Issue 9971: Optimize BufferedReader.readinto - Python tracker (original) (raw)

Created on 2010-09-28 15:19 by stutzbach, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Messages (30)

msg117530 - (view)

Author: Daniel Stutzbach (stutzbach) (Python committer)

Date: 2010-09-28 15:19

The readinto() method is intended to offer better performance than read() by allowing the caller to read into a preallocated buffer rather than constantly allocate and deallocate buffers.

However, bufferediobase_readinto() calls read(), so the extra allocations and deallocations happen anyway. On a related note, buffered_readinto() has a comment reading "TODO: use raw.readinto() instead!" which should be explored.

I can write a patch for this, but it will probably be awhile before I get to it. Anyone else who feels inspired should feel free to write one. :-)

msg135161 - (view)

Author: John O'Connor (jcon)

Date: 2011-05-04 23:06

I am new to the community but hoping to start contributing or at least following issues and learning :)

I'm looking at bufferediobase_readinto(). What I haven't yet figured out is why .readinto() is (currently) implemented at this layer of the hierarchy. You have to have a raw read buffer available to read from and I'm not sure how one would acquire that from here (without calling .read() or something that has been overridden and knows about the raw buffer).

I feel like bufferediobase_readinto() should return unsupported. Also readinto(), in theory, is lower level than read. if read isn't implemented at this layer why is readinto()?

With a little direction, I would be interested in helping w/a patch.

msg135162 - (view)

Author: Benjamin Peterson (benjamin.peterson) * (Python committer)

Date: 2011-05-04 23:09

2011/5/4 John O'Connor <report@bugs.python.org>:

John O'Connor <tehjcon@gmail.com> added the comment:

I am new to the community but hoping to start contributing or at least following issues and learning :)

I'm looking at bufferediobase_readinto(). What I haven't yet figured out is why .readinto() is (currently) implemented at this layer of the hierarchy. You have to have a raw read buffer available to read from and I'm not sure how one would acquire that from here (without calling .read() or something that has been overridden and knows about the raw buffer).

Why is that? You can, as the BufferedIOBase implementation does, just call read() and stick it into the buffer.

I feel like bufferediobase_readinto() should return unsupported. Also readinto(), in theory, is lower level than read. if read isn't implemented at this layer why is readinto()?

To provide a simple implementation for unsophisticated subclasses.

msg135166 - (view)

Author: Daniel Stutzbach (stutzbach) (Python committer)

Date: 2011-05-05 00:35

Looking at this again, I agree with John. For BufferedIOBase, read() is abstract while readinto() is concrete. That seems backward, and, indeed, it's the opposite of RawIOBase, where readinto() is abstract and read() is concrete.

Unfortunately, this code has shipped, so changing which methods are abstract may not be practical. On the other hand, the documentation doesn't mention which methods are abstract versus concrete.

All of that said, we can freely change the C implementation of BufferedReader which is a concrete class. That would allow us to cut out the extra allocoations/deallocations, even if we can't clean up the abstract vs concrete method issue. Basically, this would require greatly expanding buffered_readinto() in bufferedio.c to use _bufferedreader_raw_read() and related functions.

As I think about this more... I'm not sure how much performance there is to gain here in practice. It seems like any time I'd want to use readinto(), it's because I want to do my own buffering, in which case why would I use a BufferedReader? I'm thinking that BufferedIOBase only provides a readinto() method for completeness, so it can be used as a drop-in replacement for an unbuffered file object.

msg135172 - (view)

Author: Benjamin Peterson (benjamin.peterson) * (Python committer)

Date: 2011-05-05 04:46

I don't see the problem. You're free to override readinto() and read() in subclasses. readinto() is just implemented in BufferedIOBase as a convenience.

msg135231 - (view)

Author: Antoine Pitrou (pitrou) * (Python committer)

Date: 2011-05-05 19:06

As I think about this more... I'm not sure how much performance there is to gain here in practice. It seems like any time I'd want to use readinto(), it's because I want to do my own buffering, in which case why would I use a BufferedReader?

The difference between buffered I/O and raw I/O isn't only the presence of a buffer. For example read(N) on a buffered object will loop until N bytes are satisfied (or EOF is reached), while read(N) on a raw I/O object will give you whatever the system call returns.

I would add that third-party buffering would be better helped by the prefetch() method I proposed on python-ideas: http://mail.python.org/pipermail/python-ideas/2010-September/008179.html

readinto() is really about minimizing copies and allocations when you already have storage allocated for the result. (note that TextIOWrapper could have used readinto(), but instead opted for the slightly useless read1())

msg135245 - (view)

Author: John O'Connor (jcon)

Date: 2011-05-05 20:58

Attached patch draft for buffered_readinto(). patchcheck removed some whitespace as well.

Daniel, I agree. BufferedReader.readinto seemingly defeats the purpose of using a BufferedReader to begin with. Though, for the difference Antoine pointed out it makes sense (more of an accumulator).

Interesting thread about prefetch(). +1

msg135249 - (view)

Author: Antoine Pitrou (pitrou) * (Python committer)

Date: 2011-05-05 21:28

Thank you for the patch. A couple of comments:

the whole slow path (which loops calling _bufferedreader_raw_read()) should be surrounded by calls to ENTER_BUFFERED() and LEAVE_BUFFERED()
other things:

if (!PyArg_ParseTuple(args, "O:readinto", &buffer))
```
   return NULL;
```
if (PyObject_GetBuffer(buffer, &buf, PyBUF_CONTIG) == -1)
```
   return NULL;
```

You should use the "w*" typecode instead (see the readinto implementation in Modules/_io/fileio.c for an example).

```
       if (n == -2) {
```
```
           Py_INCREF(Py_None);
```
```
           res = Py_None;
```
```
       }
```

If written is > 0, this should return the number of bytes instead (mimicking _bufferedreader_read_generic). Adding a test for that would be nice too (in Lib/test/test_io.py).

msg135251 - (view)

Author: John O'Connor (jcon)

Date: 2011-05-05 22:41

Thanks for the feedback. I made the changes, PTAL.

msg135310 - (view)

Author: Antoine Pitrou (pitrou) * (Python committer)

Date: 2011-05-06 15:43

```
   if (n <= 0) {
```
```
       if (written > 0 || n == 0)
```
```
           break;
```
```
       if (n == -2) {
```
```
           Py_INCREF(Py_None);
```
```
           res = Py_None;
```
```
       }
```
```
       goto end;
```
```
   }
```

I think the logic is flawed: if n = -1 and written is > 0, the error will be silenced. Other than that, the patch looks good to me.

msg135331 - (view)

Author: John O'Connor (jcon)

Date: 2011-05-06 17:04

Good catch. v3 attached.

msg135420 - (view)

Author: Antoine Pitrou (pitrou) * (Python committer)

Date: 2011-05-07 09:01

Oops... It hadn't jumped at me earlier, but the patch is actually problematic performance-wise. The reason is that it doesn't buffer data at all, so small readintos become slower (they have to go through raw I/O every time):

$ ./python -m timeit -s "f=open('LICENSE', 'rb'); b = bytearray(4)"
"f.seek(0)" "while f.readinto(b): pass" -> without patch: 2.53 msec per loop -> with patch: 3.37 msec per loop

$ ./python -m timeit -s "f=open('LICENSE', 'rb'); b = bytearray(128)"
"f.seek(0)" "while f.readinto(b): pass" -> without patch: 90.3 usec per loop -> with patch: 103 usec per loop

The patch does make large reads faster, as expected:

$ ./python -m timeit -s "f=open('LICENSE', 'rb'); b = bytearray(4096)"
"f.seek(0)" "while f.readinto(b): pass" -> without patch: 13.2 usec per loop -> with patch: 6.71 usec per loop

(that's a good reminder for the future: when optimizing something, always try to measure the "improvement" :-))

One solution would be to refactor _bufferedreader_read_generic() to take an existing buffer, and use that.

msg135505 - (view)

Author: John O'Connor (jcon)

Date: 2011-05-07 21:43

It seems to me that the point of using readinto() is to avoid double-buffering (and the extra alloc/free that comes with read()). The slowness of using a small buffer size seems like only a natural and expected consequence.

The trade-off of accommodating a small buffer size (by buffering behind the scenes anyways) would likely slow the more common cases which use a decent buffer size. I am wondering if an effort to accommodate both uses would be appropriate. Possibly by not double-buffering if readinto(b): len(b) > buffer_size/2 (arbitrary but seems feasible), and copying directly as the patch does now. Otherwise, fill the buffer up for subsequent reads and copy len(b) to user buffer. There is probably a good equilibrium for when it makes more/less sense to bypass the internal buffer.

Changing _bufferedreader_read_generic() would require some of the other helpers to be changed as well but it may be the way to go.

Let me know what you think of the above. I will experiment a bit. Unfortunately I am a bit busy this weekend but cannot wait to work on this again.

msg135510 - (view)

Author: John O'Connor (jcon)

Date: 2011-05-07 23:50

FWIW, It seems Java does something similar. They write directly into caller's buffer if outstanding bytes needed (after emptying internal buffer) is greater than internal buffer len.

msg135528 - (view)

Author: Antoine Pitrou (pitrou) * (Python committer)

Date: 2011-05-08 15:49

The trade-off of accommodating a small buffer size (by buffering behind the scenes anyways) would likely slow the more common cases which use a decent buffer size. I am wondering if an effort to accommodate both uses would be appropriate. Possibly by not double-buffering if readinto(b): len(b) > buffer_size/2 (arbitrary but seems feasible), and copying directly as the patch does now. Otherwise, fill the buffer up for subsequent reads and copy len(b) to user buffer. There is probably a good equilibrium for when it makes more/less sense to bypass the internal buffer.

Yes, it sounds reasonable. I think the best thing to do is to experiment and run some measurements.

msg135552 - (view)

Author: John O'Connor (jcon)

Date: 2011-05-08 23:13

I experimented with a bunch of different options.

All benchmarks performed with:

$ for i in 1 4 128 256 1024 2048 4069 8192; do echo -n "buffer_size=${i} "; ./python -m timeit -s "f=open('LICENSE','rb');b=bytearray(${i})"
"f.seek(0)" "while f.readinto(b): pass"; done

with io.DEFAULT_BUFFER_SIZE = 8192

Before patch

buffer_size=1 100 loops, best of 3: 10.4 msec per loop buffer_size=4 100 loops, best of 3: 2.67 msec per loop buffer_size=128 10000 loops, best of 3: 102 usec per loop buffer_size=256 10000 loops, best of 3: 54.9 usec per loop buffer_size=1024 10000 loops, best of 3: 26.9 usec per loop buffer_size=2048 10000 loops, best of 3: 20.3 usec per loop buffer_size=4069 100000 loops, best of 3: 16.3 usec per loop buffer_size=8192 100000 loops, best of 3: 11.1 usec per loop

Always read into caller's buffer

buffer_size=1 100 loops, best of 3: 14 msec per loop buffer_size=4 100 loops, best of 3: 4.02 msec per loop buffer_size=128 10000 loops, best of 3: 114 usec per loop buffer_size=256 10000 loops, best of 3: 63.7 usec per loop buffer_size=1024 100000 loops, best of 3: 19.4 usec per loop buffer_size=2048 100000 loops, best of 3: 11.2 usec per loop

buffer_size=4069 100000 loops, best of 3: 8.12 usec per loop
buffer_size=8192 100000 loops, best of 3: 5.79 usec per loop

Read into caller's buffer if java-like bound of internal buffer size is smaller

buffer_size=1 100 loops, best of 3: 5.01 msec per loop buffer_size=4 1000 loops, best of 3: 1.27 msec per loop buffer_size=128 10000 loops, best of 3: 46.9 usec per loop buffer_size=256 10000 loops, best of 3: 29.5 usec per loop buffer_size=1024 100000 loops, best of 3: 12.9 usec per loop buffer_size=2048 100000 loops, best of 3: 10.8 usec per loop buffer_size=4069 100000 loops, best of 3: 9.06 usec per loop

buffer_size=8192 100000 loops, best of 3: 5.78 usec per loop

Using bound = buffer_size / 2

buffer_size=1 100 loops, best of 3: 6.04 msec per loop buffer_size=4 1000 loops, best of 3: 1.34 msec per loop buffer_size=128 10000 loops, best of 3: 49.4 usec per loop buffer_size=256 10000 loops, best of 3: 29.5 usec per loop buffer_size=1024 100000 loops, best of 3: 13.2 usec per loop buffer_size=2048 100000 loops, best of 3: 10.7 usec per loop

buffer_size=4069 100000 loops, best of 3: 8.66 usec per loop buffer_size=8192 100000 loops, best of 3: 6.1 usec per loop

Using bound = buffer_size / 4

buffer_size=1 100 loops, best of 3: 5.45 msec per loop buffer_size=4 1000 loops, best of 3: 1.34 msec per loop buffer_size=128 10000 loops, best of 3: 49.6 usec per loop buffer_size=256 10000 loops, best of 3: 28.8 usec per loop buffer_size=1024 100000 loops, best of 3: 13.1 usec per loop buffer_size=2048 100000 loops, best of 3: 12.8 usec per loop buffer_size=4069 100000 loops, best of 3: 8.42 usec per loop buffer_size=8192 100000 loops, best of 3: 5.93 usec per loop

Always use internal buffer

buffer_size=1 100 loops, best of 3: 4.53 msec per loop
buffer_size=4 1000 loops, best of 3: 1.14 msec per loop
buffer_size=128 10000 loops, best of 3: 45 usec per loop
buffer_size=256 10000 loops, best of 3: 26.9 usec per loop
buffer_size=1024 100000 loops, best of 3: 12.9 usec per loop
buffer_size=2048 100000 loops, best of 3: 10.2 usec per loop buffer_size=4069 100000 loops, best of 3: 9.15 usec per loop buffer_size=8192 100000 loops, best of 3: 8.42 usec per loop

Use read() instead

$ for i in 1 4 128 256 1024 2048 4069 8192; do echo -n "size=${i} "; ./python -m timeit -s "f=open('LICENSE','rb');" "f.seek(0)"
"while f.read(${i}): pass"; done

size=1 100 loops, best of 3: 2.56 msec per loop
size=4 1000 loops, best of 3: 709 usec per loop
size=128 10000 loops, best of 3: 29.7 usec per loop
size=256 100000 loops, best of 3: 18.6 usec per loop size=1024 100000 loops, best of 3: 13.3 usec per loop size=2048 100000 loops, best of 3: 10.8 usec per loop size=4069 100000 loops, best of 3: 10.1 usec per loop size=8192 100000 loops, best of 3: 6.41 usec per loop

Based on the above I think you are right about using the internal buffer regardless (revision attached). You pay a price with larger buffer sizes but on balance it seems to be a much better general purpose solution. The java-like solution is decent as well, it is only slightly slower for small reads and optimal for larger buffer sizes. Personally, I would opt for the performance with larger buffer sizes but I can only speculate as to what the general user would do. We could always note the trade-off in the docs.

Interestingly enough it seems like using read() is a better idea if you use a smaller size (how is that right?).

I attached the afformentioned revisions.

msg135553 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2011-05-08 23:19

It would be nice if you can also patch _pyio. I read sometimes _pyio to check how _io is implemented.

msg135554 - (view)

Author: Antoine Pitrou (pitrou) * (Python committer)

Date: 2011-05-08 23:21

It would be nice if you can also patch _pyio. I read sometimes _pyio to check how _io is implemented.

Well, _pyio implements the same APIs but it doesn't necessarily use the same algorithms.

msg135692 - (view)

Author: Antoine Pitrou (pitrou) * (Python committer)

Date: 2011-05-10 09:26

Based on the above I think you are right about using the internal buffer regardless (revision attached). You pay a price with larger buffer sizes but on balance it seems to be a much better general purpose solution. The java-like solution is decent as well, it is only slightly slower for small reads and optimal for larger buffer sizes.

Thanks for the measurements! The way the loop is written is a bit quirky, can't the flow be made more sequential:

```
   if (n > 0) /* short circuit */
```
```
      goto drain;
```

   if (n == 0 || (n == -2 && written > 0))

```
       break;
```
```
   if (n == -2) {
```
```
       Py_INCREF(Py_None);
```
```
       res = Py_None;
```
```
   }
```
```
   goto end;
```

Also, it seems you don't have the unlocked fast path for small reads anymore (when they can be serviced in whole from the buffer).

Interestingly enough it seems like using read() is a better idea if you use a smaller size (how is that right?).

That's interesting indeed. Perhaps that is due to the overhead of getting a Py_buffer to the readinto() argument. It would be nice to optimize it, but it's probably outside the scope for this issue.

(it also shows that our allocator does quite well with small objects)

msg135694 - (view)

Author: Antoine Pitrou (pitrou) * (Python committer)

Date: 2011-05-10 09:36

By the way, the Java-like version actually seems quite interesting.

msg135698 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2011-05-10 11:01

Py_SAFE_DOWNCAST(READAHEAD(self), Py_off_t, Py_ssize_t)

Why downcasting the size? Can't you store the size into a Py_off_t? I suppose that sizeof(Py_off_t) >= sizeof(Py_ssize_t).

msg135727 - (view)

Author: John O'Connor (jcon)

Date: 2011-05-10 19:06

Victor: AFAIK its not actually downcasting. The safe downcast just uses an assertion when debugging is enabled. I chose to use it because it seems to be a convention in the file.

Antoine: You say quirky, I say elegant :) Though I have no problem changing it. Also, I did think about leaving the fast-path the way it was. I thought what I have now might be more simple/readable. On second thought I will put that specific code in front of the lock.

I do feel, for some fundamental reason, that readinto() should always be faster than read(). I may look into the buffer argument overhead you mention.

Also, while we're at it, would it be worthwhile for me to make a patch for the prefech() method you proposed? Should a separate issue be created for that? I know there was no definitive answer in the email thread but it may be fun to experiment with as well.

msg135729 - (view)

Author: Antoine Pitrou (pitrou) * (Python committer)

Date: 2011-05-10 19:18

Also, while we're at it, would it be worthwhile for me to make a patch for the prefech() method you proposed? Should a separate issue be created for that? I know there was no definitive answer in the email thread but it may be fun to experiment with as well.

It would be worthwhile and I am certainly in favour it it, although I can't guarantee that it can go in in the end. (yes, a separate issue should definitely be created)

msg135732 - (view)

Author: John O'Connor (jcon)

Date: 2011-05-10 19:38

No problem for me either way.

I created to track that.

John O'Connor

On Tue, May 10, 2011 at 3:19 PM, Antoine Pitrou <report@bugs.python.org>wrote:

Antoine Pitrou <pitrou@free.fr> added the comment:

Also, while we're at it, would it be worthwhile for me to make a patch for the prefech() method you proposed? Should a separate issue be created for that? I know there was no definitive answer in the email thread but it may be fun to experiment with as well.

It would be worthwhile and I am certainly in favour it it, although I can't guarantee that it can go in in the end. (yes, a separate issue should definitely be created)

Python tracker <report@bugs.python.org> <http://bugs.python.org/issue9971>

msg135764 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2011-05-11 08:51

Le mardi 10 mai 2011 à 19:06 +0000, John O'Connor a écrit :

Victor: AFAIK its not actually downcasting.

On Linux 32 bits, size_t is 32 bits, off_t is 64 bits. If the file size is 4 GB, the downcast may truncate the size of 0 byte. It would be safer to use off_t type for the n variable in buffered_readinto(), and maybe cast to size_t on the call to memcpy. At memcpy, it is safe because the maximum possible value of n is PY_SSIZE_T (2^31-1 on a 32 bits system), which fit in a size_t.

msg135768 - (view)

Author: Antoine Pitrou (pitrou) * (Python committer)

Date: 2011-05-11 10:05

On Linux 32 bits, size_t is 32 bits, off_t is 64 bits. If the file size is 4 GB, the downcast may truncate the size of 0 byte.

We are not talking about the file size here.

msg135805 - (view)

Author: John O'Connor (jcon)

Date: 2011-05-11 20:59

I've attached the latest changes based on feedback (-v5.patch)

for i in 1 4 128 256 1024 2048 4069 8192 16384; do echo -n "buffer_size=$i "; ./python -m timeit -s "f=open('LICENSE','rb');b=bytearray($i)" "f.seek(0)" "while f.readinto(b): pass"; done buffer_size=1 100 loops, best of 3: 3.96 msec per loop buffer_size=4 1000 loops, best of 3: 1.12 msec per loop buffer_size=128 10000 loops, best of 3: 40.1 usec per loop buffer_size=256 10000 loops, best of 3: 24.1 usec per loop buffer_size=1024 100000 loops, best of 3: 12.2 usec per loop buffer_size=2048 100000 loops, best of 3: 10.4 usec per loop buffer_size=4069 100000 loops, best of 3: 9.52 usec per loop buffer_size=8192 100000 loops, best of 3: 6.04 usec per loop buffer_size=16384 100000 loops, best of 3: 4.8 usec per loop

msg135811 - (view)

Author: Roundup Robot (python-dev) (Python triager)

Date: 2011-05-11 23:59

New changeset a1d77c6f4ec1 by Antoine Pitrou in branch 'default': Issue #9971: Write an optimized implementation of BufferedReader.readinto(). http://hg.python.org/cpython/rev/a1d77c6f4ec1

msg135812 - (view)

Author: Antoine Pitrou (pitrou) * (Python committer)

Date: 2011-05-12 00:02

I've committed a minimally modified version of the patch, thank you!

msg135823 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2011-05-12 08:37

You don't want to backport the optimization to at least 3.2?

History

Date

User

Action

Args

2022-04-11 14:57:07

admin

set

github: 54180

2011-05-12 08:37:51

vstinner

set

messages: +

2011-05-12 00:02:33

pitrou

set

status: open -> closed
resolution: fixed
messages: +

stage: patch review -> resolved

2011-05-11 23:59:50

python-dev

set

nosy: + python-dev
messages: +

2011-05-11 20:59:49

jcon

set

files: + issue9971-v5.patch

messages: +

2011-05-11 10:05:07

pitrou

set

messages: +

2011-05-11 08:51:40

vstinner

set

messages: +

2011-05-10 20:39:27

jcon

set

files: - unnamed

2011-05-10 19:38:53

jcon

set

files: + unnamed

messages: +

2011-05-10 19🔞51

pitrou

set

messages: +

2011-05-10 19:06:28

jcon

set

messages: +

2011-05-10 11:01:50

vstinner

set

messages: +

2011-05-10 09:36:04

pitrou

set

messages: +

2011-05-10 09:26:12

pitrou

set

messages: +

2011-05-10 02:32:01

nirai

set

nosy: + nirai

2011-05-08 23:21:40

pitrou

set

messages: +

2011-05-08 23:19:43

vstinner

set

nosy: + vstinner
messages: +

2011-05-08 23:13:46

jcon

set

files: + issue9971-like-java.patch

2011-05-08 23:13:05

jcon

set

files: + issue9971.patch

messages: +

2011-05-08 15:49:23

pitrou

set

messages: +

2011-05-08 00:19:47

jcon

set

files: - buffered_readinto2.patch

2011-05-08 00:19:34

jcon

set

files: - buffered_readinto.patch

2011-05-07 23:50:00

jcon

set

messages: +

2011-05-07 21:43:43

jcon

set

messages: +

2011-05-07 10:00:44

pitrou

set

messages: -

2011-05-07 10:00:39

pitrou

set

messages: -

2011-05-07 10:00:29

pitrou

set

messages: +

2011-05-07 09:23:37

pitrou

set

messages: +

2011-05-07 09:01:26

pitrou

set

messages: +
stage: needs patch -> patch review

2011-05-06 17:04:32

jcon

set

files: + buffered_readinto.patch

messages: +

2011-05-06 15:43:24

pitrou

set

messages: +

2011-05-05 22:41:46

jcon

set

files: + buffered_readinto2.patch

messages: +

2011-05-05 21:28:34

pitrou

set

messages: +

2011-05-05 20:58:03

jcon

set

files: + buffered_readinto.patch
keywords: + patch
messages: +

2011-05-05 19:06:14

pitrou

set

messages: +

2011-05-05 04:46:05

benjamin.peterson

set

messages: +

2011-05-05 00:35:34

stutzbach

set

messages: +

2011-05-04 23:09:55

benjamin.peterson

set

messages: +

2011-05-04 23:06:17

jcon

set

nosy: + jcon
messages: +

2011-05-02 19:54:08

pitrou

set

versions: + Python 3.3, - Python 3.2

2010-09-29 17:56:40

daniel.urban

set

nosy: + daniel.urban

2010-09-28 15:19:01

stutzbach

create