Refactor compression code to expand URL support by dhimmel · Pull Request #14576 · pandas-dev/pandas (original) (raw)
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Conversation77 Commits2 Checks0 Files changed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
[ Show hidden characters]({{ revealButtonHref }})
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add tests for the new features you added? (the compressed sources from urls)
@@ -276,7 +276,7 @@ def file_path_to_url(path): |
---|
# ZipFile is not a context manager for <= 2.6 |
# must be tuple index here since 2.6 doesn't use namedtuple for version_info |
if sys.version_info[1] <= 6: |
if compat.PY2 and sys.version_info[1] <= 6: |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't support python 2.6 anymore, so I think this check can be removed
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
f = _get_handle(f, 'r', encoding=self.encoding, |
---|
compression=self.compression, |
memory_map=self.memory_map) |
self.handles.append(f) |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line (adding each f
to handles
) will give problems, as only the handles we opened ourselves should be added, not handles passed by the user (previously those were only added inside the if statements)
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can see the effect in the failure of the test_file_handles test
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche I attempted a fix in 2f43c39, but it's not very elegant -- logic is repeated in _get_handle
and PythonParser.__init__
.
What do you think is the best solution? What situation do we not end up opening a handle ourselves? Is it a problem that in PY3 we pass path_or_buf
through io.TextIOWrapper
before adding it to handles
?
@@ -285,53 +285,84 @@ def ZipFile(*args, **kwargs): |
---|
ZipFile = zipfile.ZipFile |
def _get_handle(path, mode, encoding=None, compression=None, memory_map=False): |
"""Gets file handle for given path and mode. |
def _get_handle(source, mode, encoding=None, compression=None, |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
call this path_or_buf
in-line with spellings elsewhere
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dhimmel added a commit to dhimmel/pandas that referenced this pull request
@dhimmel yes that should be more general
dhimmel added a commit to dhimmel/pandas that referenced this pull request
Create compressed versions of the salary dataset for testing pandas-dev#14576.
Rename salary.table.csv
to salaries.tsv
because the dataset is tab rather
than comma delimited. Remove the word table because it's implied by the
extension. Rename salary.table.gz
to salaries.tsv.gz
, since compressed files
should append to not strip the original extension.
Created new files by running the following commands:
cd pandas/io/tests/parser/data
bzip2 --keep salaries.tsv
xz --keep salaries.tsv
zip salaries.tsv.zip salaries.tsv
dhimmel added a commit to dhimmel/pandas that referenced this pull request
url = self.base_url + '.zip' |
---|
url_table = read_table(url, compression="zip", engine="python") |
tm.assert_frame_equal(url_table, self.local_table) |
def test_compressed_urls(self): |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche, to try and reduce the repetition of the testing code, I made this commit (28373f1) which uses nose test generators. However, the nose doc states:
Please note that method generators are not supported in unittest.TestCase subclasses.
As a result, the tests yielded by test_compressed_urls
aren't actually running. I'm attempting to run the test using:
nosetests --verbosity=3 pandas/io/tests/parser/test_network.py:TestCompressedUrl.test_compressed_urls
Do you know the best way to proceed? I'm new to nose and pretty new to testing, so any advice will be appreciated.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we run yielded test in computation and io/pickle
will be much easier when we switch to pytest
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see those for some examples
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Current coverage is 85.29% (diff: 88.40%)
@@ master #14576 diff @@
Files 144 144
Lines 50980 50943 -37
Methods 0 0
Messages 0 0
Branches 0 0
- Hits 43470 43453 -17
- Misses 7510 7490 -20
Partials 0 0
Powered by Codecov. Last update 14e4815...b33a500
dhimmel added a commit to dhimmel/pandas that referenced this pull request
from io import TextIOWrapper |
---|
return TextIOWrapper(path_or_buf, encoding=encoding) |
elif compression: |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this be infer
here? (IIRC we have an inference function based on the file ext). Is this guaranteed to be a valid compression? (its fine, though I think easy to handle it here anyhow).
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, this function (_get_handle
) cannot consume compression=infer
. However, it would be an easy addition:
if compression == 'infer': assert is_path compression = infer_compression(path_or_buff)
One issue I'm having is that _get_handle
was undocumented before this PR. So it's a bit hard for me to know what to implement. @jreback or @jorisvandenbossche: can you comment on the tentative docstring and fill in the missing details, so I know exactly what to implement?
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dhimmel What is still missing in your current docstring?
Adding compression='infer'
here only makes sense if it would be used of course. Where happens the inference of the compression now? Because maybe it is already inferred before it is passed to this function?
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where happens the inference of the compression now?
@jorisvandenbossche, inferrence currently happens in pandas/io/parsers.py.
I don't think compression == 'infer'
will ever make it this far with the current design. @jreback does this answer your question?
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dhimmel that's fine. I just want to validate somewhere that compression in [None, ......]
or raise a ValueError
def test_url_gz_infer(self): |
---|
url = 'https://s3.amazonaws.com/pandas-test/salary.table.gz' |
url_table = read_table(url, compression="infer", engine="python") |
class TestCompressedUrl: |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inherit from object
ideally if you can use some of this code in the parser routines would be great (may not be completely possible, but want to consolidate some of that duplication)
Regarding common.py#L239-L244, I think there's a problem. Currently, when compression="infer"
and the path_or_buff
is a URL, get_filepath_or_buffer
is using the Content-Encoding header of the request to infer the compression type. However, I don't actually think compressed files on the network will provide a Content-Encoding header of their compression type. Instead I think we want to infer the compression of a URL from it's name, just as we do with paths.
For example, https://raw.githubusercontent.com/pandas-dev/pandas/master/pandas/io/tests/parser/data/salaries.csv.bz2
would infer bz2
compression because it ends with .bz2
. @jreback and @jorisvandenbossche, agree?
i think inferring based on the file name is usually enough - but u can also use the Content if it's available
f = lzma.LZMAFile(path, mode) |
---|
f = lzma.LZMAFile(path_or_buf, mode) |
# Unrecognized Compression |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahh ok we are doing the validation, nvm. then. (assume we have some tests that hit this path)
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assume we have some tests that hit this path
Actually, I don't think there were any tests for invalid compression (regardless of URL or not), so I added one in f2ce8f8.
# in Python 3, convert BytesIO or fileobjects passed with an encoding |
---|
elif compat.PY3 and isinstance(f, compat.BytesIO): |
from io import TextIOWrapper |
add_handle = ( |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe this could be a utility function in io/common
(unless its not used anywhere else)
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think add_handle
is only used here -- I created it for this PR, so nowhere else depends on it.
I'm still a little unsure about the purpose of self.handles
-- should it only contain file handles because right now we're adding handles that aren't files (for example in the case of compat.PY3 and isinstance(f, compat.BytesIO)
)?
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the self.handles
is to allow things to be closed if WE (meaning pandas opened) them, e.g. when you wrap with TextWrapper
we NEED to close them. Otherwise we try to preserve open files handles (that were passed to us)
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another way to do this would be to handle this inside _get_handle
and return from that function handles
that should be added. IMO that would be more maintainable, as now this boolean expression here depends on the content of the _get_handle
function, so better to do it there.
'gzip': '.gz', |
---|
'bz2': '.bz2', |
'zip': '.zip', |
'xz': '.xz', |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a test for invalid compression specified?
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you want a specific test for invalid compression specified for a URL? As of the latest commit (bff0f3e), compression inference has been consolidated and invalid compression raises an error before compression is attempted.
content_encoding = req.headers.get('Content-Encoding', None) |
---|
if content_encoding == 'gzip': |
# Override compression based on Content-Encoding header |
compression = 'gzip' |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think inferring based on the file name is usually enough - but u can also use the Content if it's available
@jreback, Content-Encoding
is not a versatile way to infer compression -- gzip
is the only compression encoding that pandas and the Content-Encoding specification both support. I've modified it so inference on a URL uses the extension, just as a path would. However, if Content-Encoding
is gzip
, then compression is overridden. I don't expect this situation to arise very often. Does this make sense or should we just totally ignore Content-Encoding
?
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I agree with your assements. Since this was here, we can leave it. But I agree the filename inference is better / more standard.
I'm getting failing builds in AppVeyor in Python 2 due to ImportError: No module named backports
. This occurs in compat/__init__.py
at:
def import_lzma():
""" import the backported lzma library
or raise ImportError if not available """
from backports import lzma
return lzma
Any idea why backports
is not available? How should I proceed?
Update: I think I resolved this in af3c2bc.
dhimmel added a commit to dhimmel/pandas that referenced this pull request
dhimmel added a commit to dhimmel/pandas that referenced this pull request
This was referenced
Dec 13, 2016
Assuming the tests pass for 640dc1b, I don't think there are any more unaddressed comments.
can you rebase, problem with appeveyor which i just fixed
@jreback rebased with appveyor now passing.
Can you open issues for those to keep track of them.
@jorisvandenbossche: I created #14874 and #14875.
We can just squash your additional commits into one commit, and then commit those two commits of two authors to master.
@jorisvandenbossche do you want me to squash all of my commits? If so, what should go in the commit message? Should I add the issues we're closing to the commit message?
dhimmel changed the title
WIP: Consolidate compression code Refactor compression code to expand URL support
@dhimmel you can squash or it will get squashed on the merge. in this case if you'd leave the original commit, and squash all of our subsequent ones to a single (or possibly) multiple, logically oriented commits would be great. We could in this case cherry-pick them (rather than completely squashing), to preserver authoriship (as I think when we squash ALL then authoriship is the last one). if we only have 2 commits a merge is easy.
in this case if you'd leave the original commit, and squash all of our subsequent ones to a single (or possibly) multiple, logically oriented commits would be great.
I'm going to squash all of my commits. Grouping them will be pretty burdensome and the intermediate commits were often not functional. I created a branch on dhimmel/pandas
to preserve the individual commits for reference.
Just to list things here, git log master.. --oneline
gives the following:
640dc1b Simplify TextIOWrapper usage
4774b75 Use list to store new handles
94aa0cb Track new handles in _get_handle #14576 (comment) #14576 (comment)
376dce3 Close bz2 file object in Python 2
b88d1b8 Remove compressed URL c-engine tests
b34c2f6 Move _infer_compression to io/common.py
d27e57d Fix "ImportError: No module named backports"
97a1c25 Test compressed URLs using c engine
ecee74f Revert to old error msg. Remove unused line
7c9fe8b Skip xz test if no lzma
492e615 Remove unused imports
bf9fe97 Add invalid compression test
7940ab7 Fix PEP8 errors
05332fd Update S3 get_filepath_or_buffer compression
24ea605 Fix get_filepath_or_buffer and _get_handle
2dc1f2f Clean up docstring
10652a0 Move compression inference to io/parsers
8d24bcf Delete unused function _wrap_compressed
83b2bc5 Infer compression from URL extension
c2e6e5b Improve error message, fix partial scope
9977da2 Fix nose test generator bug. Better error messages
7ff1247 Inheret from object for TestCompressedUrl
30e1caf Fix local salaries.csv path
6230dfb Update salaries.csv URL post #14587
6530f46 Enable nose execution of test generators
a441448 Broken: use method generators to DRY tests
25114c3 Tests for bz2, xz & zip URLs
89e45f4 Revert changes to zip error messages
29b7f74 Change error message in test_zip
3aba6cb Replicate original handles appending
101a344 Impove _get_handle docstring
58ffbce Rename source to path_or_buf and flake8
f0901fb Remove ZipFile support for Python 2.6
2c78178 Improve readability
592cf92 Fix ZipFile bug in Python 3.6+
74caf72 combine compression code into one
36 commits total in this PR. Wow!
dhimmel added a commit to dhimmel/pandas that referenced this pull request
Mahmoud Lababidi and others added 2 commits
if we only have 2 commits a merge is easy.
Great we're now at 2 commits. I think merge makes sense, because it is nice to see the diff of both commits combined. Alone the changesets will be incomplete.
Let me know if there's anything more for me to do.
@jreback You can use the github interface as well to "rebase and merge" instead of squashing, which will just apply the commits in this PR to master. And so the two commits now in the PR are perfect!
@jorisvandenbossche right, but this is the exception to the rule. I will have a look in a bit.
@dhimmel going to merge, and can followup on comments / issues in another PR.
jreback pushed a commit that referenced this pull request
thanks @dhimmel very nice!
in your xref PR, pls add whatsnew that cover all of the issues that were closed (for 0.20.0)
dhimmel added a commit to dhimmel/pandas that referenced this pull request
dhimmel added a commit to dhimmel/pandas that referenced this pull request
dhimmel added a commit to dhimmel/pandas that referenced this pull request
jreback pushed a commit that referenced this pull request
…th C engine (GH14874)
Follow up on #14576, which refactored compression code to expand URL support. Fixes up some small remaining issues and adds a what's new entry. - [x] Closes #14874
Author: Daniel Himmelstein daniel.himmelstein@gmail.com
Closes #14880 from dhimmel/whats-new and squashes the following commits:
e1b5d42 [Daniel Himmelstein] Address what's new review comments 8568aed [Daniel Himmelstein] TST: Read bz2 files from S3 in PY2 09dcbff [Daniel Himmelstein] DOC: Improve what's new c4ea3d3 [Daniel Himmelstein] STY: PEP8 fixes f8a7900 [Daniel Himmelstein] TST: check bz2 compression in PY2 c engine 0e0fa0a [Daniel Himmelstein] DOC: Reword get_filepath_or_buffer docstring 210fb20 [Daniel Himmelstein] DOC: What's New for refactored compression code cb91007 [Daniel Himmelstein] TST: Read compressed URLs with c engine 85630ea [Daniel Himmelstein] ENH: Support bz2 compression in PY2 for c engine a7960f6 [Daniel Himmelstein] DOC: Improve _infer_compression docstring
ischurov pushed a commit to ischurov/pandas that referenced this pull request
ischurov pushed a commit to ischurov/pandas that referenced this pull request
ischurov pushed a commit to ischurov/pandas that referenced this pull request
…th C engine (GH14874)
Follow up on pandas-dev#14576, which refactored compression code to expand URL support. Fixes up some small remaining issues and adds a what's new entry. - [x] Closes pandas-dev#14874
Author: Daniel Himmelstein daniel.himmelstein@gmail.com
Closes pandas-dev#14880 from dhimmel/whats-new and squashes the following commits:
e1b5d42 [Daniel Himmelstein] Address what's new review comments 8568aed [Daniel Himmelstein] TST: Read bz2 files from S3 in PY2 09dcbff [Daniel Himmelstein] DOC: Improve what's new c4ea3d3 [Daniel Himmelstein] STY: PEP8 fixes f8a7900 [Daniel Himmelstein] TST: check bz2 compression in PY2 c engine 0e0fa0a [Daniel Himmelstein] DOC: Reword get_filepath_or_buffer docstring 210fb20 [Daniel Himmelstein] DOC: What's New for refactored compression code cb91007 [Daniel Himmelstein] TST: Read compressed URLs with c engine 85630ea [Daniel Himmelstein] ENH: Support bz2 compression in PY2 for c engine a7960f6 [Daniel Himmelstein] DOC: Improve _infer_compression docstring
ShaharBental pushed a commit to ShaharBental/pandas that referenced this pull request
…th C engine (GH14874)
Follow up on pandas-dev#14576, which refactored compression code to expand URL support. Fixes up some small remaining issues and adds a what's new entry. - [x] Closes pandas-dev#14874
Author: Daniel Himmelstein daniel.himmelstein@gmail.com
Closes pandas-dev#14880 from dhimmel/whats-new and squashes the following commits:
e1b5d42 [Daniel Himmelstein] Address what's new review comments 8568aed [Daniel Himmelstein] TST: Read bz2 files from S3 in PY2 09dcbff [Daniel Himmelstein] DOC: Improve what's new c4ea3d3 [Daniel Himmelstein] STY: PEP8 fixes f8a7900 [Daniel Himmelstein] TST: check bz2 compression in PY2 c engine 0e0fa0a [Daniel Himmelstein] DOC: Reword get_filepath_or_buffer docstring 210fb20 [Daniel Himmelstein] DOC: What's New for refactored compression code cb91007 [Daniel Himmelstein] TST: Read compressed URLs with c engine 85630ea [Daniel Himmelstein] ENH: Support bz2 compression in PY2 for c engine a7960f6 [Daniel Himmelstein] DOC: Improve _infer_compression docstring
Labels
IO issues that don't fit into a more specific label