Issue 5804: Add an 'offset' argument to zlib.decompress (original) (raw)

Issue5804

Created on 2009-04-21 09:51 by kristjan.jonsson, last changed 2022-04-11 14:56 by admin.

Files
File name Uploaded Description Edit
zlibpatch.patch kristjan.jonsson,2009-04-21 09:51
zlibpatch.patch kristjan.jonsson,2009-04-21 16:13
zlibpatch2.patch kristjan.jonsson,2009-04-23 19:18
Messages (16)
msg86227 - (view) Author: Kristján Valur Jónsson (kristjan.jonsson) * (Python committer) Date: 2009-04-21 09:51
It can be useful to get whatever remaining data there exists in the decompression stream for further processing. This is currently possible by creating a zlib.decompressobj and accessing its unused_data member, but that is quite cumbersome. The attached patch adds a 'tail' keyword argument to zlib.decompress which causes the result to be a 2-tuple, with the first part containing the decompressed data, and the second containing any remaining data
msg86240 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-04-21 15:43
As usual the patch should come with some unit tests. Also, it seems you don't test the return value of PyString_FromStringAndSize before building your tuple.
msg86241 - (view) Author: Kristján Valur Jónsson (kristjan.jonsson) * (Python committer) Date: 2009-04-21 15:46
Unittests coming up. It is not necessary to test the return value, it is done by Py_BuildValue(). I think it is common to do it this way, but if You think explicit testing is better, then that's no problem.
msg86244 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-04-21 15:57
> It is not necessary to test the return value, it is done by > Py_BuildValue(). Sorry, it's ok then.
msg86247 - (view) Author: Kristján Valur Jónsson (kristjan.jonsson) * (Python committer) Date: 2009-04-21 16:13
New patch, with explicit error test and tests in test_zlib.py, both for the "tail" parameter and the hitherto untested "unused_data" attribute. I'm not completely happy with the name of the argument, "tail", any suggestions? Also, since I made compress a "keywords" method, should I apply that to the rest of the functions?
msg86284 - (view) Author: Kristján Valur Jónsson (kristjan.jonsson) * (Python committer) Date: 2009-04-22 10:26
Note, I thought of the "tail" argument to reflect the "unused_data" member of the decompressobj(). But a more useful way is actually this: result, offset = zlib.decompress(buffer, offset=0) if offset<len(buffer): result2, offset = zlib.decompress(buffer, offset=offset) This avoids data copying as much as possible. The presence of a non- default offset argument, would trigger the "tuple" return value. We could add the same argument to decompressobj.decompress, and add "unused_offset" member there for symmetry. Any thoughts?
msg86373 - (view) Author: Kristján Valur Jónsson (kristjan.jonsson) * (Python committer) Date: 2009-04-23 19:18
I've added a new patch, which instead uses an "offset" parameter. This is more natural, perhaps, and also avoids data copying. Complete with tests and documentation update.
msg87550 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2009-05-10 21:05
Overall, this looks good. Some mostly minor comments in this review. http://codereview.appspot.com/63060/diff/1/2 File Doc/library/zlib.rst (right): http://codereview.appspot.com/63060/diff/1/2#newcode136 Line 136: When specified, it will cause the function's return value to be a (uncompressed, offset) how about this wording: ..."to be a tuple of (uncompressed, offset), with the"... http://codereview.appspot.com/63060/diff/1/2#newcode142 Line 142: Added the *offset* argument missing . at the end of the sentence. Also, mention that this function now accepts keyword arguments. http://codereview.appspot.com/63060/diff/1/3 File Lib/test/test_zlib.py (right): http://codereview.appspot.com/63060/diff/1/3#newcode110 Line 110: c = zlib.compress(a)+zlib.compress(b) please surround the + with spaces for readability and consistency with other code in the file.\ http://codereview.appspot.com/63060/diff/1/4 File Modules/zlibmodule.c (right): http://codereview.appspot.com/63060/diff/1/4#newcode193 Line 193: "the start position in the string to start decompresion and, if spceified,\n" typo: specified. http://codereview.appspot.com/63060/diff/1/4#newcode296 Line 296: offset = length-zst.avail_in+offset; leave spaces around your - and + operators and this becomes easier to read. http://codereview.appspot.com/63060
msg87551 - (view) Author: Robert Collins (rbcollins) * (Python committer) Date: 2009-05-10 21:14
Maybe I'm missing something, but isn't the offset parameter just another way of spelling buffer(input, offset)? I like the avoiding of copying, just wondering if having a magic parameter to get a tuple is really better than (say) result, used = zlib.decompress2(source) if used<len(source): result2, used = zlib.decompress2(buffer(source, used)) This changes two things. Rather than a magic parameter it adds a new function which always returns tuples (meaning that you don't need magic to make sure users don't accidentally pass offset=None and get a single result when they wanted a tuple), and uses the built in buffer facility to avoid copying rather than relying on delayed slicing.
msg87567 - (view) Author: Kristján Valur Jónsson (kristjan.jonsson) * (Python committer) Date: 2009-05-11 08:56
Well, yes and no. I agree that it is annoying to have to add another parameter to get a returning tuple. But you have to weigh that against adding another conveinence API. On the other hand, I think the offset argument is quite convenient. If we were to add a "decompress2" function, I would like to keep the offset because it removes the extra overhead of having to create a temporary buffer object, at no extra cost. Also, note that your example code becomes more complex, for a loop. You would need to: results = [] used =0 while used<len(source): r, u = zlib.decompress2(buffer(source, used)) used += u results.append(r) as opposed to: r, used = zlib.decompress2(source, used) results.append(r) You have to make sure you add the buffer's hidden offset to the output offset. There is a third option. The "offset" could be a list containg the offset in its 0th position, a sort of "byref" passing, but that is not in the general Python spirit, is it? This would give us the loop: o = [0] while len(source)>o[0] results.append(zlib.decompress(source, offset=offsetlist))
msg87568 - (view) Author: Robert Collins (rbcollins) * (Python committer) Date: 2009-05-11 09:02
Well, I think its relatively uncommon to be doing such a loop with a static buffer anyway - often you'll instead be reading from disk or a network stream; if we could make those cases simpler and avoid copying that would be great. Anyhow, no strong opinions here, just saw it going by before added my 2c. -Rob
msg110993 - (view) Author: Mark Lawrence (BreamoreBoy) * Date: 2010-07-21 02:47
This has been reviewed see , can we get this into 3.2?
msg111017 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2010-07-21 07:25
The function returns different kind of data depending on the value of the last parameter. I don't like it at all. Why is decompressobj not enough?
msg113415 - (view) Author: Kristján Valur Jónsson (kristjan.jonsson) * (Python committer) Date: 2010-08-09 13:53
decompressobj is indeed "enough". But if you are doing a lot of this (decompressing chunks), then using the unused_data, as it is, involves a lot of copying. If there were a "unused_data_pos" or some equivalent, then it would be possible to continue decommpressing with buffer(olddata, unused_data_pos), without copying the source data. The point of the "offset" keyword argument was to have a way to get this end position also without having an explicit decompressobj. I agree that having the result type change to a tuple is not good. So, a suggestion: 1) Add the unused_data_pos to the decompressobj. 2) (opotional) Add the automatic retrieval of this with a decompress_ex method, for convenience, making it usable without creating a decompressobj 3) (optional) Add the "offset" kw argument to decompress and (decompress_ex) making the creation of a temporary buffer() object unnecessary.
msg229045 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2014-10-10 23:24
Moving this from commit review back to no selection, since there doesn't yet seem to be an agreement on an API.
msg234040 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-01-14 22:27
A different test case for “unused_data” attribute was added in 2012 for Issue 16350, so that part is no longer needed. If this feature goes ahead, it might be nice to also update the bzip and LZMA modules for consistency. In Python 3, the equivalent of the buffer() option would look like this, assuming the memory view is in byte format: zlib.decompress(memoryview(source)[unused_offset:]) Another option would be to copy some paint from the struct.unpack_from() bikeshed: [data, offset] = zlib.decompress_from(buffer, offset)
History
Date User Action Args
2022-04-11 14:56:48 admin set github: 50054
2015-01-14 22:27:13 martin.panter set nosy: + martin.pantermessages: +
2014-10-10 23:24:51 r.david.murray set versions: + Python 3.5, - Python 3.2nosy: + r.david.murraymessages: + stage: commit review ->
2014-02-03 19:14:31 BreamoreBoy set nosy: - BreamoreBoy
2012-01-26 13:07:45 nadeem.vawda set nosy: + nadeem.vawda
2010-08-09 13:53:55 kristjan.jonsson set messages: +
2010-07-21 07:25:33 amaury.forgeotdarc set nosy: + amaury.forgeotdarcmessages: +
2010-07-21 02:47:37 BreamoreBoy set versions: + Python 3.2, - Python 3.1, Python 2.7nosy: + BreamoreBoymessages: + stage: patch review -> commit review
2009-05-11 09:02:04 rbcollins set messages: +
2009-05-11 08:56:24 kristjan.jonsson set messages: +
2009-05-10 21:14:40 rbcollins set nosy: + rbcollinsmessages: +
2009-05-10 21:05:29 gregory.p.smith set messages: +
2009-05-10 20:49:45 gregory.p.smith set nosy: + gregory.p.smith
2009-05-08 09:45:40 kristjan.jonsson set title: Add a "tail" argument to zlib.decompress -> Add an 'offset' argument to zlib.decompress
2009-04-23 19🔞40 kristjan.jonsson set files: + zlibpatch2.patchmessages: +
2009-04-22 10:26:02 kristjan.jonsson set messages: +
2009-04-21 16:13:14 kristjan.jonsson set files: + zlibpatch.patchmessages: +
2009-04-21 15:57:20 pitrou set messages: +
2009-04-21 15:46:42 kristjan.jonsson set messages: +
2009-04-21 15:43:15 pitrou set priority: normalversions: + Python 3.1nosy: + pitroumessages: + stage: patch review
2009-04-21 09:51:41 kristjan.jonsson create