BUG: GH11786 Thread safety issue with read_csv by jdeschenes · Pull Request #11790 · pandas-dev/pandas (original) (raw)

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

closes #11786

Fixed an issue with thread safety when calling read_csv with a StringIO object.

The issue was caused by a misplaced PyGilSate_Ensure()

@jdeschenes gr8! can you add in the example from the issue as a smoke test. (e.g. just have it run), then read in with a single trhead and compare.

and pls add a release note when you are satisified.

Alright, I did this quickly as I don't have time to work on this right now. How long until next release?

@jdeschenes oh have a while.....when you have a chance...thanks!

@jdeschenes if you can have a look at this again would be great.

jreback

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert that the read in values match a single threaded reader. (e.g. compare frames)

Thank you both for keeping up on this.

@jdeschenes IIRC this issue is repro with actual files. Is that not the case? is it only StringIO/BytesIO. are they not thread-safe?

Hi @jreback,

the issue is solely reproducible with StringIO. The root cause of this bug is in function buffer_rd_bytes in
pandas/src/parser/io.c. This function is only used when a StringIO/BytesIO is passed to the read_csv function.

The function was calling Py_XDECREF before ensuring that the thread had the GIL. This behavior could not be seen before since the GIL was always locked throughout the read_csv function call.

I am not aware of any issues when reading from disk and this pull request will not fix any problem related to this.

I think that the release notes should be kept as is.

Let me know what you think.

ok, can you add a test that validates the issue that reading from a disk with multiple threads is ok (so we don't regress).

jreback

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use double backticks around StringIO

pls run git diff master | flake8 --diff as much PEP checking has been one on these files.

FWIW using BytesIO has actual use cases in distributed computing, it isn't just a test case.

Many parallel storage systems won't give you access to the hard disk but will instead deliver a bunch of bytes. In this case the best way I've found to use pd.read_csv is to hand it a BytesIO object.

@mrocklin oh of course. just covering the bases. I suspect people have tried multi-threading to read files as well :)

…tringIO object., pandas-dev#11786

The issue was caused by a misplaced PyGilSate_Ensure()

It would be very interesting to see if there is any benefit in using a ThreadPool for reading from a BytesIO. We are spending a lot of time into the GIL, thanks to the buffer_rd_bytes function. It should probably be benchmarked.

I have a suspicion that it doesn't help at all(It might be even a net loss).

I added the test for the file read. I didn't do it for the BytesIO. The code would effectively look a lot like what I did up top... Grabbing a list of BytesIO and processing them in a ThreadPool. I can take a look at this a bit later, if that is required.

@jdeschenes thanks!

certainly would take addtl benchmarks / fixes!

jreback added a commit that referenced this pull request

Jan 19, 2016

Hey all - any estimate of when this will be go out in a production release? Encountering this bug very very frequently with 0.17.1, and would like to get back up to a newer version of pandas again soon

Thanks

planning on a RC in about 2 weeks, so release should be roughly mid-feb or so

BUG: GH11786 Thread safety issue with read_csv by jdeschenes · Pull Request #11790 · pandas-dev/pandas (original) (raw)

Conversation

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Labels