Refactor compression code to expand URL support by dhimmel · Pull Request #14576 · pandas-dev/pandas (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation77 Commits2 Checks0 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

dhimmel

jorisvandenbossche

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add tests for the new features you added? (the compressed sources from urls)

@@ -276,7 +276,7 @@ def file_path_to_url(path):
# ZipFile is not a context manager for <= 2.6
# must be tuple index here since 2.6 doesn't use namedtuple for version_info
if sys.version_info[1] <= 6:
if compat.PY2 and sys.version_info[1] <= 6:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't support python 2.6 anymore, so I think this check can be removed

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

f = _get_handle(f, 'r', encoding=self.encoding,
compression=self.compression,
memory_map=self.memory_map)
self.handles.append(f)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line (adding each f to handles) will give problems, as only the handles we opened ourselves should be added, not handles passed by the user (previously those were only added inside the if statements)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can see the effect in the failure of the test_file_handles test

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche I attempted a fix in 2f43c39, but it's not very elegant -- logic is repeated in _get_handle and PythonParser.__init__.

What do you think is the best solution? What situation do we not end up opening a handle ourselves? Is it a problem that in PY3 we pass path_or_buf through io.TextIOWrapper before adding it to handles?

jreback

@@ -285,53 +285,84 @@ def ZipFile(*args, **kwargs):
ZipFile = zipfile.ZipFile
def _get_handle(path, mode, encoding=None, compression=None, memory_map=False):
"""Gets file handle for given path and mode.
def _get_handle(source, mode, encoding=None, compression=None,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

call this path_or_buf in-line with spellings elsewhere

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dhimmel added a commit to dhimmel/pandas that referenced this pull request

Nov 3, 2016

@dhimmel

@dhimmel

@dhimmel

@jreback

@dhimmel yes that should be more general

dhimmel added a commit to dhimmel/pandas that referenced this pull request

Nov 4, 2016

@dhimmel

Create compressed versions of the salary dataset for testing pandas-dev#14576.

Rename salary.table.csv to salaries.tsv because the dataset is tab rather than comma delimited. Remove the word table because it's implied by the extension. Rename salary.table.gz to salaries.tsv.gz, since compressed files should append to not strip the original extension.

Created new files by running the following commands:

cd pandas/io/tests/parser/data
bzip2 --keep salaries.tsv
xz --keep salaries.tsv
zip salaries.tsv.zip salaries.tsv

dhimmel added a commit to dhimmel/pandas that referenced this pull request

Nov 10, 2016

@dhimmel

dhimmel

url = self.base_url + '.zip'
url_table = read_table(url, compression="zip", engine="python")
tm.assert_frame_equal(url_table, self.local_table)
def test_compressed_urls(self):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche, to try and reduce the repetition of the testing code, I made this commit (28373f1) which uses nose test generators. However, the nose doc states:

Please note that method generators are not supported in unittest.TestCase subclasses.

As a result, the tests yielded by test_compressed_urls aren't actually running. I'm attempting to run the test using:

nosetests --verbosity=3 pandas/io/tests/parser/test_network.py:TestCompressedUrl.test_compressed_urls

Do you know the best way to proceed? I'm new to nose and pretty new to testing, so any advice will be appreciated.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we run yielded test in computation and io/pickle

will be much easier when we switch to pytest

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see those for some examples

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@codecov-io

Current coverage is 85.29% (diff: 88.40%)

Merging #14576 into master will increase coverage by 0.02%

@@ master #14576 diff @@

Files 144 144
Lines 50980 50943 -37
Methods 0 0
Messages 0 0
Branches 0 0

Powered by Codecov. Last update 14e4815...b33a500

dhimmel added a commit to dhimmel/pandas that referenced this pull request

Nov 11, 2016

@dhimmel

jreback

from io import TextIOWrapper
return TextIOWrapper(path_or_buf, encoding=encoding)
elif compression:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be infer here? (IIRC we have an inference function based on the file ext). Is this guaranteed to be a valid compression? (its fine, though I think easy to handle it here anyhow).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, this function (_get_handle) cannot consume compression=infer. However, it would be an easy addition:

if compression == 'infer': assert is_path compression = infer_compression(path_or_buff)

One issue I'm having is that _get_handle was undocumented before this PR. So it's a bit hard for me to know what to implement. @jreback or @jorisvandenbossche: can you comment on the tentative docstring and fill in the missing details, so I know exactly what to implement?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dhimmel What is still missing in your current docstring?

Adding compression='infer' here only makes sense if it would be used of course. Where happens the inference of the compression now? Because maybe it is already inferred before it is passed to this function?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where happens the inference of the compression now?

@jorisvandenbossche, inferrence currently happens in pandas/io/parsers.py.

I don't think compression == 'infer' will ever make it this far with the current design. @jreback does this answer your question?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dhimmel that's fine. I just want to validate somewhere that compression in [None, ......] or raise a ValueError

jreback

def test_url_gz_infer(self):
url = 'https://s3.amazonaws.com/pandas-test/salary.table.gz'
url_table = read_table(url, compression="infer", engine="python")
class TestCompressedUrl:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inherit from object

@jreback

ideally if you can use some of this code in the parser routines would be great (may not be completely possible, but want to consolidate some of that duplication)

@dhimmel

Regarding common.py#L239-L244, I think there's a problem. Currently, when compression="infer" and the path_or_buff is a URL, get_filepath_or_buffer is using the Content-Encoding header of the request to infer the compression type. However, I don't actually think compressed files on the network will provide a Content-Encoding header of their compression type. Instead I think we want to infer the compression of a URL from it's name, just as we do with paths.

For example, https://raw.githubusercontent.com/pandas-dev/pandas/master/pandas/io/tests/parser/data/salaries.csv.bz2 would infer bz2 compression because it ends with .bz2. @jreback and @jorisvandenbossche, agree?

@jreback

i think inferring based on the file name is usually enough - but u can also use the Content if it's available

jreback

f = lzma.LZMAFile(path, mode)
f = lzma.LZMAFile(path_or_buf, mode)
# Unrecognized Compression

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh ok we are doing the validation, nvm. then. (assume we have some tests that hit this path)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assume we have some tests that hit this path

Actually, I don't think there were any tests for invalid compression (regardless of URL or not), so I added one in f2ce8f8.

jreback

# in Python 3, convert BytesIO or fileobjects passed with an encoding
elif compat.PY3 and isinstance(f, compat.BytesIO):
from io import TextIOWrapper
add_handle = (

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe this could be a utility function in io/common (unless its not used anywhere else)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think add_handle is only used here -- I created it for this PR, so nowhere else depends on it.

I'm still a little unsure about the purpose of self.handles -- should it only contain file handles because right now we're adding handles that aren't files (for example in the case of compat.PY3 and isinstance(f, compat.BytesIO))?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the self.handles is to allow things to be closed if WE (meaning pandas opened) them, e.g. when you wrap with TextWrapper we NEED to close them. Otherwise we try to preserve open files handles (that were passed to us)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another way to do this would be to handle this inside _get_handle and return from that function handles that should be added. IMO that would be more maintainable, as now this boolean expression here depends on the content of the _get_handle function, so better to do it there.

jreback

'gzip': '.gz',
'bz2': '.bz2',
'zip': '.zip',
'xz': '.xz',

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a test for invalid compression specified?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you want a specific test for invalid compression specified for a URL? As of the latest commit (bff0f3e), compression inference has been consolidated and invalid compression raises an error before compression is attempted.

dhimmel

content_encoding = req.headers.get('Content-Encoding', None)
if content_encoding == 'gzip':
# Override compression based on Content-Encoding header
compression = 'gzip'

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think inferring based on the file name is usually enough - but u can also use the Content if it's available

@jreback, Content-Encoding is not a versatile way to infer compression -- gzip is the only compression encoding that pandas and the Content-Encoding specification both support. I've modified it so inference on a URL uses the extension, just as a path would. However, if Content-Encoding is gzip, then compression is overridden. I don't expect this situation to arise very often. Does this make sense or should we just totally ignore Content-Encoding?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I agree with your assements. Since this was here, we can leave it. But I agree the filename inference is better / more standard.

@dhimmel

I'm getting failing builds in AppVeyor in Python 2 due to ImportError: No module named backports. This occurs in compat/__init__.py at:

def import_lzma():
    """ import the backported lzma library
    or raise ImportError if not available """
    from backports import lzma
    return lzma

Any idea why backports is not available? How should I proceed?

Update: I think I resolved this in af3c2bc.

dhimmel added a commit to dhimmel/pandas that referenced this pull request

Dec 4, 2016

@dhimmel

dhimmel added a commit to dhimmel/pandas that referenced this pull request

Dec 13, 2016

@dhimmel

This was referenced

Dec 13, 2016

@dhimmel

Assuming the tests pass for 640dc1b, I don't think there are any more unaddressed comments.

can you rebase, problem with appeveyor which i just fixed

@jreback rebased with appveyor now passing.

Can you open issues for those to keep track of them.

@jorisvandenbossche: I created #14874 and #14875.

We can just squash your additional commits into one commit, and then commit those two commits of two authors to master.

@jorisvandenbossche do you want me to squash all of my commits? If so, what should go in the commit message? Should I add the issues we're closing to the commit message?

@dhimmel dhimmel changed the titleWIP: Consolidate compression code Refactor compression code to expand URL support

Dec 13, 2016

@jreback

@dhimmel you can squash or it will get squashed on the merge. in this case if you'd leave the original commit, and squash all of our subsequent ones to a single (or possibly) multiple, logically oriented commits would be great. We could in this case cherry-pick them (rather than completely squashing), to preserver authoriship (as I think when we squash ALL then authoriship is the last one). if we only have 2 commits a merge is easy.

@dhimmel

in this case if you'd leave the original commit, and squash all of our subsequent ones to a single (or possibly) multiple, logically oriented commits would be great.

I'm going to squash all of my commits. Grouping them will be pretty burdensome and the intermediate commits were often not functional. I created a branch on dhimmel/pandas to preserve the individual commits for reference.

Just to list things here, git log master.. --oneline gives the following:

640dc1b Simplify TextIOWrapper usage
4774b75 Use list to store new handles
94aa0cb Track new handles in _get_handle #14576 (comment) #14576 (comment)
376dce3 Close bz2 file object in Python 2
b88d1b8 Remove compressed URL c-engine tests
b34c2f6 Move _infer_compression to io/common.py
d27e57d Fix "ImportError: No module named backports"
97a1c25 Test compressed URLs using c engine
ecee74f Revert to old error msg. Remove unused line
7c9fe8b Skip xz test if no lzma
492e615 Remove unused imports
bf9fe97 Add invalid compression test
7940ab7 Fix PEP8 errors
05332fd Update S3 get_filepath_or_buffer compression
24ea605 Fix get_filepath_or_buffer and _get_handle
2dc1f2f Clean up docstring
10652a0 Move compression inference to io/parsers
8d24bcf Delete unused function _wrap_compressed
83b2bc5 Infer compression from URL extension
c2e6e5b Improve error message, fix partial scope
9977da2 Fix nose test generator bug. Better error messages
7ff1247 Inheret from object for TestCompressedUrl
30e1caf Fix local salaries.csv path
6230dfb Update salaries.csv URL post #14587
6530f46 Enable nose execution of test generators
a441448 Broken: use method generators to DRY tests
25114c3 Tests for bz2, xz & zip URLs
89e45f4 Revert changes to zip error messages
29b7f74 Change error message in test_zip
3aba6cb Replicate original handles appending
101a344 Impove _get_handle docstring
58ffbce Rename source to path_or_buf and flake8
f0901fb Remove ZipFile support for Python 2.6
2c78178 Improve readability
592cf92 Fix ZipFile bug in Python 3.6+
74caf72 combine compression code into one

36 commits total in this PR. Wow!

dhimmel added a commit to dhimmel/pandas that referenced this pull request

Dec 13, 2016

@dhimmel

Mahmoud Lababidi and others added 2 commits

December 13, 2016 13:34

@dhimmel

@dhimmel

@dhimmel

if we only have 2 commits a merge is easy.

Great we're now at 2 commits. I think merge makes sense, because it is nice to see the diff of both commits combined. Alone the changesets will be incomplete.

Let me know if there's anything more for me to do.

@jorisvandenbossche

@jreback You can use the github interface as well to "rebase and merge" instead of squashing, which will just apply the commits in this PR to master. And so the two commits now in the PR are perfect!

@jreback

@jorisvandenbossche right, but this is the exception to the rule. I will have a look in a bit.

@dhimmel going to merge, and can followup on comments / issues in another PR.

jreback pushed a commit that referenced this pull request

Dec 13, 2016

@jreback

@jreback

thanks @dhimmel very nice!

in your xref PR, pls add whatsnew that cover all of the issues that were closed (for 0.20.0)

@jorisvandenbossche

dhimmel added a commit to dhimmel/pandas that referenced this pull request

Dec 14, 2016

@dhimmel

dhimmel added a commit to dhimmel/pandas that referenced this pull request

Dec 14, 2016

@dhimmel

dhimmel added a commit to dhimmel/pandas that referenced this pull request

Dec 15, 2016

@dhimmel

jreback pushed a commit that referenced this pull request

Dec 17, 2016

@dhimmel @jreback

…th C engine (GH14874)

Follow up on #14576, which refactored compression code to expand URL support. Fixes up some small remaining issues and adds a what's new entry. - [x] Closes #14874

Author: Daniel Himmelstein daniel.himmelstein@gmail.com

Closes #14880 from dhimmel/whats-new and squashes the following commits:

e1b5d42 [Daniel Himmelstein] Address what's new review comments 8568aed [Daniel Himmelstein] TST: Read bz2 files from S3 in PY2 09dcbff [Daniel Himmelstein] DOC: Improve what's new c4ea3d3 [Daniel Himmelstein] STY: PEP8 fixes f8a7900 [Daniel Himmelstein] TST: check bz2 compression in PY2 c engine 0e0fa0a [Daniel Himmelstein] DOC: Reword get_filepath_or_buffer docstring 210fb20 [Daniel Himmelstein] DOC: What's New for refactored compression code cb91007 [Daniel Himmelstein] TST: Read compressed URLs with c engine 85630ea [Daniel Himmelstein] ENH: Support bz2 compression in PY2 for c engine a7960f6 [Daniel Himmelstein] DOC: Improve _infer_compression docstring

ischurov pushed a commit to ischurov/pandas that referenced this pull request

Dec 19, 2016

@ischurov

ischurov pushed a commit to ischurov/pandas that referenced this pull request

Dec 19, 2016

@dhimmel @ischurov

ischurov pushed a commit to ischurov/pandas that referenced this pull request

Dec 19, 2016

@dhimmel @ischurov

…th C engine (GH14874)

Follow up on pandas-dev#14576, which refactored compression code to expand URL support. Fixes up some small remaining issues and adds a what's new entry. - [x] Closes pandas-dev#14874

Author: Daniel Himmelstein daniel.himmelstein@gmail.com

Closes pandas-dev#14880 from dhimmel/whats-new and squashes the following commits:

e1b5d42 [Daniel Himmelstein] Address what's new review comments 8568aed [Daniel Himmelstein] TST: Read bz2 files from S3 in PY2 09dcbff [Daniel Himmelstein] DOC: Improve what's new c4ea3d3 [Daniel Himmelstein] STY: PEP8 fixes f8a7900 [Daniel Himmelstein] TST: check bz2 compression in PY2 c engine 0e0fa0a [Daniel Himmelstein] DOC: Reword get_filepath_or_buffer docstring 210fb20 [Daniel Himmelstein] DOC: What's New for refactored compression code cb91007 [Daniel Himmelstein] TST: Read compressed URLs with c engine 85630ea [Daniel Himmelstein] ENH: Support bz2 compression in PY2 for c engine a7960f6 [Daniel Himmelstein] DOC: Improve _infer_compression docstring

ShaharBental pushed a commit to ShaharBental/pandas that referenced this pull request

Dec 26, 2016

@dhimmel @ShaharBental

…th C engine (GH14874)

Follow up on pandas-dev#14576, which refactored compression code to expand URL support. Fixes up some small remaining issues and adds a what's new entry. - [x] Closes pandas-dev#14874

Author: Daniel Himmelstein daniel.himmelstein@gmail.com

Closes pandas-dev#14880 from dhimmel/whats-new and squashes the following commits:

e1b5d42 [Daniel Himmelstein] Address what's new review comments 8568aed [Daniel Himmelstein] TST: Read bz2 files from S3 in PY2 09dcbff [Daniel Himmelstein] DOC: Improve what's new c4ea3d3 [Daniel Himmelstein] STY: PEP8 fixes f8a7900 [Daniel Himmelstein] TST: check bz2 compression in PY2 c engine 0e0fa0a [Daniel Himmelstein] DOC: Reword get_filepath_or_buffer docstring 210fb20 [Daniel Himmelstein] DOC: What's New for refactored compression code cb91007 [Daniel Himmelstein] TST: Read compressed URLs with c engine 85630ea [Daniel Himmelstein] ENH: Support bz2 compression in PY2 for c engine a7960f6 [Daniel Himmelstein] DOC: Improve _infer_compression docstring

Labels

IO Data

IO issues that don't fit into a more specific label