DOC for refactored compression (GH14576) + BUG: bz2-compressed URL with C engine (GH14874) by dhimmel · Pull Request #14880 · pandas-dev/pandas (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation24 Commits10 Checks0 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

dhimmel

jorisvandenbossche

url = ('https://github.com/pandas-dev/pandas/raw/master/' +
'pandas/io/tests/parser/data/salaries.csv.bz2')
pd.read_table(url, compression='infer') # default, infer compression
pd.read_table(url, compression='xz') # explicitly specify compression

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will give an error, as the url is bz2, is that the intention?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is a mistake... they should both be bz2 or xz.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add .head() or so, as otherwise the output will get a bit long for just demonstrating the compression keyword

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or assign to df = without showing its content, also a possibility

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't realize this code would get executed. Now assigns to df and then uses head to display first two rows. In cb40a41.

'handle')
content = source.read()
source.close()
from pandas.compat import StringIO

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pandas.compat is already imported as compat

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed those two issues. Was a nice opportunity for me to try out --fixup. We can do the --autosquash when the PR looks ready to go.

jreback

@@ -44,6 +44,19 @@ fixed-width text files, and :func:`read_excel` for parsing Excel files.
pd.read_fwf(StringIO(data)).dtypes
pd.read_fwf(StringIO(data), dtype={'a':'float64', 'b':'object'}).dtypes
Reading dataframes from URLs, in :func:`read_csv` or :func:`read_table`, now

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would make a sub-section

jreback

Reading dataframes from URLs, in :func:`read_csv` or :func:`read_table`, now
supports additional compression methods (`xz`, `bz2`, `zip`). Previously, only
`gzip` compression was supported. By default, compression of URLs and paths are
now both inferred using their file extensions.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add all of the issues that were closed here, e.g. (:issue:`....`, :issue:`...`)

@jreback

@dhimmel

Okay I cleaned up the commit history with

git rebase --interactive --autosquash 30025d82564fc27fbab58fbd791009e5b77a23db git push --force

Assuming tests pass, I think this PR is ready. @jreback / @jorisvandenbossche -- you may want to double check the what's new, since made some additional changes in 3f4cd45.

jorisvandenbossche

@jorisvandenbossche jorisvandenbossche changed the titleMake refactored compression code release ready DOC for refactored compression (GH14576) + BUG: bz2-compressed URL with C engine (GH14874)

Dec 15, 2016

jorisvandenbossche

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only some remaining lint:

pandas/io/tests/parser/compression.py:11:1: F401 'pandas.compat' imported but unused

Get file handle for given path/buffer and mode.
Get the compression method for filepath_or_buffer. If compression='infer',
the inferred compression method is returned. Otherwise, the input
compression method is returned unchanged, unless it's invalid, in which case

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP8 line too long (see bottom third travis test)

@codecov-io

Current coverage is 84.55% (diff: 100%)

Merging #14880 into master will decrease coverage by 0.75%

@@ master #14880 diff @@

Files 144 144
Lines 51015 51043 +28
Methods 0 0
Messages 0 0
Branches 0 0

Powered by Codecov. Last update d1b1720...e1b5d42

@dhimmel

@dhimmel

Reference corresponding issues in What's New.

Change code example to use string formating for improved modularity.

Add what's new id

@dhimmel

@jorisvandenbossche, I fixed the PEP8 issues. Had to rebase to resolve a conflict that arose in doc/source/whatsnew/v0.20.0.txt and did some squashing/rearranging. As a result, the chronological commit order GitHub shows, isn't the actual commit order.

Also there may be an issue with S3 tests failing. I was going to wait for the latest build and see if the issues are still there.

@jorisvandenbossche

It's indeed failing. What is the reason for this? It seems you didn't change anything related to that?

did some squashing/rearranging

Just as a personal recommendation, I would not put too much effort in it. Previously we asked people to squash their PRs, but now the github interface (and our own tooling) has improved, and we squash on merge. Your previous PR was just a special case as there were two commit authors to retain.

@dhimmel

@dhimmel

It's indeed failing. What is the reason for this? It seems you didn't change anything related to that?

Some S3 tests were still expecting bz2 failure in Python 2. Fixed in 8568aed.

Previously we asked people to squash their PRs, but now the github interface (and our own tooling) has improved, and we squash on merge.

Instead of rebasing, is it acceptable to merge in the current master? This way we don't keep destroying references to old commit hashes.

jreback

dataframes from URLs in :func:`read_csv` or :func:`read_table` now supports
additional compression methods: ``xz``, ``bz2``, and ``zip`` (:issue:`14570`).
Previously, only ``gzip`` compression was supported. By default, compression of
URLs and paths are now both inferred using their file extensions. Additionally,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The compression code

paths are not inferred using (remove both)

Additionally, support for bz2 compress in the python 2 c-engine improved.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed comments 1 and 3 in e1b5d42. @jreback, I didn't change:

By default, compression of URLs and paths are now both inferred using their file extensions.

Previously, compression of paths was by default inferred from their extension, but not URLs. Now both are inferred by their extension. Am I missing something?

jreback

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Compression code was refactored (:issue:`12688`). As a result, reading
dataframes from URLs in :func:`read_csv` or :func:`read_table` now supports

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if there are any other issues that were closed by this, pls list them as well.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rechecked... they're all already listed.

@jreback

@dhimmel

@dhimmel

@jreback

ischurov pushed a commit to ischurov/pandas that referenced this pull request

Dec 19, 2016

@dhimmel @ischurov

…th C engine (GH14874)

Follow up on pandas-dev#14576, which refactored compression code to expand URL support. Fixes up some small remaining issues and adds a what's new entry. - [x] Closes pandas-dev#14874

Author: Daniel Himmelstein daniel.himmelstein@gmail.com

Closes pandas-dev#14880 from dhimmel/whats-new and squashes the following commits:

e1b5d42 [Daniel Himmelstein] Address what's new review comments 8568aed [Daniel Himmelstein] TST: Read bz2 files from S3 in PY2 09dcbff [Daniel Himmelstein] DOC: Improve what's new c4ea3d3 [Daniel Himmelstein] STY: PEP8 fixes f8a7900 [Daniel Himmelstein] TST: check bz2 compression in PY2 c engine 0e0fa0a [Daniel Himmelstein] DOC: Reword get_filepath_or_buffer docstring 210fb20 [Daniel Himmelstein] DOC: What's New for refactored compression code cb91007 [Daniel Himmelstein] TST: Read compressed URLs with c engine 85630ea [Daniel Himmelstein] ENH: Support bz2 compression in PY2 for c engine a7960f6 [Daniel Himmelstein] DOC: Improve _infer_compression docstring

ShaharBental pushed a commit to ShaharBental/pandas that referenced this pull request

Dec 26, 2016

@dhimmel @ShaharBental

…th C engine (GH14874)

Follow up on pandas-dev#14576, which refactored compression code to expand URL support. Fixes up some small remaining issues and adds a what's new entry. - [x] Closes pandas-dev#14874

Author: Daniel Himmelstein daniel.himmelstein@gmail.com

Closes pandas-dev#14880 from dhimmel/whats-new and squashes the following commits:

e1b5d42 [Daniel Himmelstein] Address what's new review comments 8568aed [Daniel Himmelstein] TST: Read bz2 files from S3 in PY2 09dcbff [Daniel Himmelstein] DOC: Improve what's new c4ea3d3 [Daniel Himmelstein] STY: PEP8 fixes f8a7900 [Daniel Himmelstein] TST: check bz2 compression in PY2 c engine 0e0fa0a [Daniel Himmelstein] DOC: Reword get_filepath_or_buffer docstring 210fb20 [Daniel Himmelstein] DOC: What's New for refactored compression code cb91007 [Daniel Himmelstein] TST: Read compressed URLs with c engine 85630ea [Daniel Himmelstein] ENH: Support bz2 compression in PY2 for c engine a7960f6 [Daniel Himmelstein] DOC: Improve _infer_compression docstring

Labels

IO Data

IO issues that don't fit into a more specific label