COMPAT: reading json with lines=True from s3, xref #17200 by alph486 · Pull Request #17201 · pandas-dev/pandas (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation81 Commits22 Checks0 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

Attempt to decode the bytes array with encoding passed to the call.

@alph486 : Thanks for the PR! You will need to write tests for us to merge this. I would also take a look at what @TomAugspurger suggested in terms of patching this.

Is there a particular way you guys go about handling s3 in your tests? I see some files in a bucket somewhere, but idk if there are any jsonl type files or how to add one. For that matter, I don't see any tests for read_json where the file is in s3.

Hello @alph486! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on November 27, 2017 at 00:04 Hours UTC

@gfyoung After going back to clean this up I realized jsonl from URL based file also suffers from the same issue. I have a cleaner fix that I think works in both cases.

Can you let me know how I should proceed for getting a JSONL file uploaded to the public s3 bucket for testing this in the test suite? I don't have write access.

Can you let me know how I should proceed for getting a JSONL file uploaded to the public s3 bucket for testing this in the test suite? I don't have write access.

@TomAugspurger @jreback : thoughts?

if you put the file on a gist somewhere @TomAugspurger or I can upload to the bucket.

jreback changed the title~~Fix for #17200~~ COMPAT: reading json with lines=True from s3, xref #17200

Aug 8, 2017

@jreback @gfyoung Thanks for the assist here. I'll wrap this up soon; just after I get my current project completed that requires this change.

@alph486 : Are you still waiting on the updated URL to patch the failing tests?

@gfyoung yeah, i just need a target s3 url to put in my code

yeah I don't think I have this either. @TomAugspurger IIRC you can login using the Continuum dashboard.

if that works we should prob change it :>

sorry to keep nudging here, but any progress from IT @TomAugspurger ? Id love to try to a patched official version out so we can use our feature in production.

No word yet.

Really, we should be using https://github.com/spulec/moto to mock all our S3 calls. Now that s3fs (and boto3) are handling all the network stuff, we don't need to be testing it. Just our text parsing. I'll open an issue about that (and you can look at fixing it if you're interested :)

#17325

I can try to implement my tests using moto... we'll see if that works Thanks, Kevin

…

On Aug 24, 2017, 6:04 AM -0500, Tom Augspurger ***@***.***>, wrote: No word yet. Really, we should be using https://github.com/spulec/moto to mock all our S3 calls. Now that s3fs (and boto3) are handling all the network stuff, we don't need to be testing it. Just our text parsing. I'll open an issue about that (and you can look at fixing it if you're interested :) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

Thanks, I'd be totally fine with just using moto for your test for now, and following up moving most of the other S3 tests to use moto.

You'll need to edit one of the ci files to include moto. Probably ci/requirements-3.6.run

Ah, thanks for that. I hope to get to trying that locally today, I'll update here when I know. Thanks, Kevin

…

On Aug 24, 2017, 8:28 AM -0500, Tom Augspurger ***@***.***>, wrote: Thanks, I'd be totally fine with just using moto for your test for now, and following up moving most of the other S3 tests to use moto. You'll need to edit one of the ci files to include moto. Probably ci/requirements-3.6.run — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

Any guidance on which requirements files I should change for both release packaging, development, and CI? (I currently have ci/requirements-3.6.run and requirements_dev.txt). I just have to add s3fs to some things and moto where appropriate.

you shouldn't add s3fs anywhere

Unless I'm misunderstanding: you cannot run any of the test suites that require pandas.io.s3 locally unless s3fs is part of requirements_dev.txt no?

Edit:

For example from my attempt of running pytest pandas/tests/io/json/test_pandas.py -v:


pandas/tests/io/json/test_pandas.py:996:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pandas/io/json/json.py:324: in read_json
    encoding=encoding)
pandas/io/common.py:208: in get_filepath_or_buffer
    from pandas.io import s3
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

    """ s3 support for remote file interactivity """
    from pandas import compat
    try:
        import s3fs
        from botocore.exceptions import NoCredentialsError
    except:
>       raise ImportError("The s3fs library is required to handle s3 files")
E       ImportError: The s3fs library is required to handle s3 files

pandas/io/s3.py:7: ImportError
=====================================

Kevin Kuhl added 2 commits

September 22, 2017 09:37

@TomAugspurger Yeah, but part of those unused imports are a backwards way of me leveraging the s3_resource that arent necessarily just cleanup. I replied to a comment of yours earlier on about using s3_resource. I'd need to move it and potentially refactor some of that code to use it in my module. Can you check that out and lmk your thoughts?

@alph486 sorry I missed your comment earlier. I'll push a commit in a minute that refactors the s3_resource stuff.

Ok thanks, sorry about it taking so long. I'm not super confident in here.

OK pushed two commits. One gets you back up to date with master, and 6979fb8 is the actual changes. I added a pandas/tests/io/conftest.py with the s3_resource that both places use now.

If you git pull to test locally, you'll have to rebuild the C extensions (`python setup.py build_ext --inplace). Hopefully everything is OK though.

@jreback are you okay with the changes made per your review?

Should be all good. ping on green @alph486