BUG: GH11786 Thread safety issue with read_csv by jdeschenes · Pull Request #11790 · pandas-dev/pandas (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation30 Commits1 Checks0 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

jdeschenes

closes #11786

Fixed an issue with thread safety when calling read_csv with a StringIO object.

The issue was caused by a misplaced PyGilSate_Ensure()

@jreback

@jdeschenes gr8! can you add in the example from the issue as a smoke test. (e.g. just have it run), then read in with a single trhead and compare.

and pls add a release note when you are satisified.

@jdeschenes

Alright, I did this quickly as I don't have time to work on this right now. How long until next release?

@jreback

@jdeschenes oh have a while.....when you have a chance...thanks!

@jreback

@jdeschenes if you can have a look at this again would be great.

@jdeschenes

jreback

files = [BytesIO(b) for b in bytes]
# Read all files in many threads
pool = ThreadPool(8)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert that the read in values match a single threaded reader. (e.g. compare frames)

@mrocklin

Thank you both for keeping up on this.

@jreback

@jdeschenes IIRC this issue is repro with actual files. Is that not the case? is it only StringIO/BytesIO. are they not thread-safe?

@jreback

@jdeschenes

Hi @jreback,

the issue is solely reproducible with StringIO. The root cause of this bug is in function buffer_rd_bytes in
pandas/src/parser/io.c. This function is only used when a StringIO/BytesIO is passed to the read_csv function.

The function was calling Py_XDECREF before ensuring that the thread had the GIL. This behavior could not be seen before since the GIL was always locked throughout the read_csv function call.

I am not aware of any issues when reading from disk and this pull request will not fix any problem related to this.

I think that the release notes should be kept as is.

Let me know what you think.

@jreback

ok, can you add a test that validates the issue that reading from a disk with multiple threads is ok (so we don't regress).

jreback

- Bug in vectorized ``DateOffset`` when ``n`` parameter is ``0`` (:issue:`11370`)
- Compat for numpy 1.11 w.r.t. ``NaT`` comparison changes (:issue:`12049`)
- Bug in ``read_csv`` when reading from a StringIO in threads (:issue:`11790`)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use double backticks around StringIO

@jreback

pls run git diff master | flake8 --diff as much PEP checking has been one on these files.

@mrocklin

FWIW using BytesIO has actual use cases in distributed computing, it isn't just a test case.

Many parallel storage systems won't give you access to the hard disk but will instead deliver a bunch of bytes. In this case the best way I've found to use pd.read_csv is to hand it a BytesIO object.

@jreback

@mrocklin oh of course. just covering the bases. I suspect people have tried multi-threading to read files as well :)

@jdeschenes

…tringIO object., pandas-dev#11786

The issue was caused by a misplaced PyGilSate_Ensure()

@jdeschenes

It would be very interesting to see if there is any benefit in using a ThreadPool for reading from a BytesIO. We are spending a lot of time into the GIL, thanks to the buffer_rd_bytes function. It should probably be benchmarked.

I have a suspicion that it doesn't help at all(It might be even a net loss).

I added the test for the file read. I didn't do it for the BytesIO. The code would effectively look a lot like what I did up top... Grabbing a list of BytesIO and processing them in a ThreadPool. I can take a look at this a bit later, if that is required.

@jreback

@jdeschenes thanks!

certainly would take addtl benchmarks / fixes!

jreback added a commit that referenced this pull request

Jan 19, 2016

@jreback

@kayvonr

Hey all - any estimate of when this will be go out in a production release? Encountering this bug very very frequently with 0.17.1, and would like to get back up to a newer version of pandas again soon

Thanks

@jreback

planning on a RC in about 2 weeks, so release should be roughly mid-feb or so