msg266747 - (view) |
Author: Vladimir Mihailenco (Vladimir Mihailenco) |
Date: 2016-05-31 09:41 |
Consider following code: ``` import zlib hello = b'hello' # Compress data using deflate and shared/predefined dict. co = zlib.compressobj(wbits=-zlib.MAX_WBITS, zdict=hello) data = co.compress(hello) + co.flush() # Decompress deflate by providing same dict. do = zlib.decompressobj(wbits=-zlib.MAX_WBITS, zdict=hello) data = do.decompress(data) print(data) ``` Decompression fails with "zlib.error: Error -3 while decompressing data: invalid distance too far back". My original task was to decompress data I get from Golang library - https://golang.org/pkg/compress/flate/#NewWriterDict |
|
|
msg266804 - (view) |
Author: Xiang Zhang (xiang.zhang) *  |
Date: 2016-06-01 07:48 |
This seems related with the difference between zlib format and raw format. When you do raw inflate, you have to inflateSetDictionary before any inflate. You cannot rely on the first inflate to return Z_NEED_DICT and then do inflateSetDictionary. Although I have the solution with experiment, but there is nothing can be found in the official zlib documentation. It is rather vague. I'll make more research and submit a patch later.. Right now, you can use zlib format (positive wbits) if it can achieve your goal. It's behaviour is right. |
|
|
msg266814 - (view) |
Author: Xiang Zhang (xiang.zhang) *  |
Date: 2016-06-01 11:02 |
I think my conclusion in last message is right. A similar bug in CPAN can be found in https://rt.cpan.org/Public/Bug/Display.html?id=36046. And I can find in zlib's changelog inflateSetDictionary starts to support raw inflate from 1.2.2.1. Patch attached now allows Python to decompress raw deflate format with a dictionary. Hope some core devs are willing to review. |
|
|
msg266819 - (view) |
Author: Xiang Zhang (xiang.zhang) *  |
Date: 2016-06-01 14:59 |
Forget to add test. Upload a new one with test included. |
|
|
msg266871 - (view) |
Author: Martin Panter (martin.panter) *  |
Date: 2016-06-02 03:59 |
Thanks for the patches. Maybe Gregory knows more about this than I do (I never used zdict). But I think I have figured it out today, so I left some review comments :) I’m not sure whether this really is a bug fix, although it would have little if any impact on existing code. It seems zdict was implemented with just the non-raw zlib format in mind, so supporting raw deflate mode is a new feature. If we were to add full support of inflateSetDictionary() for raw deflate mode, I imagine it would be a new decompressobj.set_dictionary() method. I did some quick tests with specifying an unnecessary zdict buffer to the decompressor (when zdict not used for compression), and it doesn’t _seem_ to break anything, but I am not 100% sure. Notes from my learning what zdict means: Zdict seems to be a bunch of user data to setup or train the (de)compressor with. It was added by Issue 14684. (Not a Python dictionary, nor any special zlib dictionary format. The Python documentation is confusing, and the Go documentation made more sense, assuming it is equivalent.) See the documentation of in/deflateSetDictionary() in <http://www.zlib.net/manual.html> or zlib.h. Looking through the history at <https://github.com/madler/zlib> can also be useful. |
|
|
msg266876 - (view) |
Author: Xiang Zhang (xiang.zhang) *  |
Date: 2016-06-02 05:59 |
Thanks for your review, Martin. :) I have replied with after thinking and try. Yes, this is more like a feature request. But for the end user, this behaviour is quite like a bug. > If we were to add full support of inflateSetDictionary() for raw deflate mode, I imagine it would be a new decompressobj.set_dictionary() method. I don't support adding a new set_dictionary() method. Right now decompressobj handles raw deflate format, zlib format and gzip format. Adding a set_dictionary() method for raw deflate format seems weird. If we are going to do so, I suggest reorganize the zlib like PHP and nodejs do, separating different formats into different objects. > I did some quick tests with specifying an unnecessary zdict buffer to the decompressor (when zdict not used for compression), and it doesn’t _seem_ to break anything, but I am not 100% sure. This patch only affects raw inflate. You have to set wbits negative. If you are decompressing with an unnecessary zdict, I think it ought to emit an error. With Python background, zlib dictionary is really hard to understand. I think manual doesn't tell much. I understand why zdict is needed until reading https://blog.cloudflare.com/improving-compression-with-preset-deflate-dictionary/. |
|
|
msg267391 - (view) |
Author: Martin Panter (martin.panter) *  |
Date: 2016-06-05 06:21 |
Here is a patch which I propose to use that factors out a set_inflate_zdict() function. I think this will help with maintainence, e.g. if someone changes the UINT_MAX checking, but forgets that there are two places than need changing. Let me know if that is okay with you :) |
|
|
msg267405 - (view) |
Author: Xiang Zhang (xiang.zhang) *  |
Date: 2016-06-05 09:05 |
It looks good. I don't think that is a matter since usually I will do a search before change. But of course extracting them to a subroutine helps. |
|
|
msg267416 - (view) |
Author: Roundup Robot (python-dev)  |
Date: 2016-06-05 12:36 |
New changeset d91b951e676f by Martin Panter in branch '3.5': Issue #27164: Allow decompressing raw Deflate streams with predefined zdict https://hg.python.org/cpython/rev/d91b951e676f New changeset 470954641f3b by Martin Panter in branch 'default': Issue #27164: Merge raw Deflate zdict support from 3.5 https://hg.python.org/cpython/rev/470954641f3b |
|
|