msg198075 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2013-09-19 14:22 |
There are some classes in gzip, bz2, lzma, and zipfile modules which implement buffered reader interface. They read chunks of data from underlied file object, decompress it, save in internal buffer, and provide common methods to read from this buffer. Maintaining of duplicated code is cumbersome and error prone. Proposed preliminary patch moves common code into new private class _io2._BufferedReaderMixin. If the proposition will be accepted in general, I'm going to write C version and move it into the io module. Perhaps even then merge it with io.BufferedIOBase. The idea is that all buffered reading functions (read(), read1(), readline(), peek(), etc) can be expressed in the term of one function which returns raw unbuffered data. Subclasses need define only one such function and will got all buffered reader interface. In case of mentioned above classes this functions reads and decompresses a chunk of data from underlied file. The HTTPResponse class perhaps will benefit too (). |
|
|
msg198077 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2013-09-19 14:33 |
Here are benchmark script and its results. |
|
|
msg198078 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2013-09-19 14:47 |
See for a more flexible primitive. |
|
|
msg198082 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2013-09-19 15:26 |
This primitive doesn't not well fit a case of compressed streams. A chunk of compressed data read from underlied file object can be uncompressed to unpredictable large data. We can't limit the size of buffer. Another point is that buffer interface is not very appropriate for Python implementation. And we want left as much Python code in gzip, bz2, lzma and zipfile as possible. Copying from bytes into buffer and back will just waste resources. |
|
|
msg198143 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2013-09-20 12:32 |
If you want this, I think it should be somehow folded into existing classes (for example BufferedIOBase). Yet another implementation of readline() isn't really a good idea. |
|
|
msg233747 - (view) |
Author: Martin Panter (martin.panter) *  |
Date: 2015-01-09 12:22 |
For what it’s worth, it would be better if compressed streams did limit the amount of data they decompressed, so that they are not susceptible to decompression bombs; see Issue 15955. But having a flexible-sized buffer could be useful in other cases. I haven’t looked closely at the code, but I wonder if there is much difference from the existing BufferedReader. Perhaps just that the underlying raw stream in this case can deliver data in arbitrary-sized chunks, but BufferedReader expects its raw stream to deliver data in limited-sized chunks? If you exposed the buffer it could be useful to do many things more efficiently: * readline() with custom newline or end-of-record codes, solving Issue 1152248, Issue 17083 * scan the buffer using string operations or regular expressions etc, e.g. to skip whitespace, read a run of unescaped symbols * tentatively read data to see if a keyword is present, but roll back if the data doesn’t match the keyword |
|
|
msg233806 - (view) |
Author: Martin Panter (martin.panter) *  |
Date: 2015-01-10 08:57 |
Parts of the patch here actually do the same thing as my LZMAFile patch for Issue 15955. I wish I had looked at the patch earlier! The difference is I used a proposed max_length parameter for the decompressor rather than unlimited decompression, and I used the existing BufferedReader class rather than implementing a new custom one. The changes for the “gzip” module could probably be merged with my GzipFile patch at Issue 15955 and made to use BufferedReader. |
|
|
msg240695 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2015-04-13 18:18 |
Is it still relevant, now that #23529 has been committed? |
|
|
msg241403 - (view) |
Author: Martin Panter (martin.panter) *  |
Date: 2015-04-18 08:22 |
The LZMA, gzip and bzip modules now all use BufferedReader, so Serhiy’s patch is no longer relevant for them. Serhiy’s patch also changed the zipfile module, which may be still relevant. On the other hand, perhaps it would be more ideal to use BufferedReader for the zipfile module as well. |
|
|