COMPAT: reading json with lines=True from s3, xref #17200 by alph486 · Pull Request #17201 · pandas-dev/pandas (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation81 Commits22 Checks0 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

alph486

Attempt to decode the bytes array with encoding passed to the call.

@gfyoung

@alph486 : Thanks for the PR! You will need to write tests for us to merge this. I would also take a look at what @TomAugspurger suggested in terms of patching this.

@alph486

Is there a particular way you guys go about handling s3 in your tests? I see some files in a bucket somewhere, but idk if there are any jsonl type files or how to add one. For that matter, I don't see any tests for read_json where the file is in s3.

@pep8speaks

Hello @alph486! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on November 27, 2017 at 00:04 Hours UTC

@alph486

@gfyoung After going back to clean this up I realized jsonl from URL based file also suffers from the same issue. I have a cleaner fix that I think works in both cases.

Can you let me know how I should proceed for getting a JSONL file uploaded to the public s3 bucket for testing this in the test suite? I don't have write access.

@gfyoung

Can you let me know how I should proceed for getting a JSONL file uploaded to the public s3 bucket for testing this in the test suite? I don't have write access.

@TomAugspurger @jreback : thoughts?

@jreback

if you put the file on a gist somewhere @TomAugspurger or I can upload to the bucket.

@jreback jreback changed the titleFix for #17200 COMPAT: reading json with lines=True from s3, xref #17200

Aug 8, 2017

@alph486

@jreback @gfyoung Thanks for the assist here. I'll wrap this up soon; just after I get my current project completed that requires this change.

@alph486

@alph486

@gfyoung

@alph486 : Are you still waiting on the updated URL to patch the failing tests?

@alph486

@gfyoung yeah, i just need a target s3 url to put in my code

@gfyoung

@TomAugspurger

@alph486

@jreback

yeah I don't think I have this either. @TomAugspurger IIRC you can login using the Continuum dashboard.

@jreback

if that works we should prob change it :>

@TomAugspurger

@alph486

sorry to keep nudging here, but any progress from IT @TomAugspurger ? Id love to try to a patched official version out so we can use our feature in production.

@TomAugspurger

No word yet.

Really, we should be using https://github.com/spulec/moto to mock all our S3 calls. Now that s3fs (and boto3) are handling all the network stuff, we don't need to be testing it. Just our text parsing. I'll open an issue about that (and you can look at fixing it if you're interested :)

#17325

@alph486

I can try to implement my tests using moto... we'll see if that works Thanks, Kevin

On Aug 24, 2017, 6:04 AM -0500, Tom Augspurger ***@***.***>, wrote: No word yet. Really, we should be using https://github.com/spulec/moto to mock all our S3 calls. Now that s3fs (and boto3) are handling all the network stuff, we don't need to be testing it. Just our text parsing. I'll open an issue about that (and you can look at fixing it if you're interested :) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

@TomAugspurger

Thanks, I'd be totally fine with just using moto for your test for now, and following up moving most of the other S3 tests to use moto.

You'll need to edit one of the ci files to include moto. Probably ci/requirements-3.6.run

@alph486

Ah, thanks for that. I hope to get to trying that locally today, I'll update here when I know. Thanks, Kevin

On Aug 24, 2017, 8:28 AM -0500, Tom Augspurger ***@***.***>, wrote: Thanks, I'd be totally fine with just using moto for your test for now, and following up moving most of the other S3 tests to use moto. You'll need to edit one of the ci files to include moto. Probably ci/requirements-3.6.run — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

@alph486

Any guidance on which requirements files I should change for both release packaging, development, and CI? (I currently have ci/requirements-3.6.run and requirements_dev.txt). I just have to add s3fs to some things and moto where appropriate.

@jreback

you shouldn't add s3fs anywhere

@alph486

Unless I'm misunderstanding: you cannot run any of the test suites that require pandas.io.s3 locally unless s3fs is part of requirements_dev.txt no?

Edit:

For example from my attempt of running pytest pandas/tests/io/json/test_pandas.py -v:


pandas/tests/io/json/test_pandas.py:996:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pandas/io/json/json.py:324: in read_json
    encoding=encoding)
pandas/io/common.py:208: in get_filepath_or_buffer
    from pandas.io import s3
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

    """ s3 support for remote file interactivity """
    from pandas import compat
    try:
        import s3fs
        from botocore.exceptions import NoCredentialsError
    except:
>       raise ImportError("The s3fs library is required to handle s3 files")
E       ImportError: The s3fs library is required to handle s3 files

pandas/io/s3.py:7: ImportError
=====================================

Kevin Kuhl added 2 commits

September 22, 2017 09:37

@TomAugspurger

@alph486

@TomAugspurger Yeah, but part of those unused imports are a backwards way of me leveraging the s3_resource that arent necessarily just cleanup. I replied to a comment of yours earlier on about using s3_resource. I'd need to move it and potentially refactor some of that code to use it in my module. Can you check that out and lmk your thoughts?

@TomAugspurger

@TomAugspurger

@alph486 sorry I missed your comment earlier. I'll push a commit in a minute that refactors the s3_resource stuff.

@alph486

Ok thanks, sorry about it taking so long. I'm not super confident in here.

@TomAugspurger

@TomAugspurger

OK pushed two commits. One gets you back up to date with master, and 6979fb8 is the actual changes. I added a pandas/tests/io/conftest.py with the s3_resource that both places use now.

If you git pull to test locally, you'll have to rebuild the C extensions (`python setup.py build_ext --inplace). Hopefully everything is OK though.

@alph486

@jreback are you okay with the changes made per your review?

@TomAugspurger

Should be all good. ping on green @alph486

jreback

@@ -1,2 +1,3 @@
xarray==0.9.1
pandas-gbq

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needds to be reverted

@@ -5,4 +5,4 @@ cython
pytest>=3.1.0
pytest-cov
flake8

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revert this, s3fs is NOT a requirement for dev; we should be robust to not having this installed

@@ -26,3 +26,4 @@ sqlalchemy
bottleneck
pymysql

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is ok

"""
If PY3 and/or isinstance(json, bytes)
"""
if isinstance(json, bytes):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 line comment

@@ -0,0 +1,2 @@
{"a": 1, "b": 2}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the purpose of this file?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, ok you have to have this named .json otherwise it won't be picked up by setup.py (IOW the install test will fail).

@pytest.fixture(scope='module')
def tips_file():
return os.path.join(tm.get_data_path(), 'tips.csv')

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool

jreback

@@ -0,0 +1,2 @@
{"a": 1, "b": 2}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, ok you have to have this named .json otherwise it won't be picked up by setup.py (IOW the install test will fail).

@jreback

@jreback

@jreback

@jreback jreback added the Compat

pandas objects compatability with Numpy or Python functions

label

Nov 27, 2017

jreback

@jreback

TomAugspurger pushed a commit to TomAugspurger/pandas that referenced this pull request

Dec 8, 2017

@TomAugspurger

Labels

Compat

pandas objects compatability with Numpy or Python functions

IO JSON

read_json, to_json, json_normalize