BUG: properly close files opened by parsers by agraboso · Pull Request #13940 · pandas-dev/pandas (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation28 Commits14 Checks0 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

agraboso

Addresses all most of the ResourceWarnings from #13932. There were three tests that opened files but did not close them. The rest were due to parsers (CParserWrapper and PythonParser) opening files or streams but never closing them. There was one case — the mmap one — that was tricky because there was a file handle going into it and immediately reaching a refcount of zero.

I am at a loss about writing new tests for this. Is it even possible?

@agraboso

Argh, there are still some ResourceWarnings left... Let me investigate.

@jreback

jreback

@@ -972,3 +972,5 @@ Bug Fixes
- Bug in ``Index`` raises ``KeyError`` displaying incorrect column when column is not in the df and columns contains duplicate values (:issue:`13822`)
- Bug in ``Period`` and ``PeriodIndex`` creating wrong dates when frequency has combined offset aliases (:issue:`13874`)
- Bug where files opened by parsers (used by `pd.read_csv`, e.g.) were not closed properly. (:issue:`13940`)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double back ticks around pd.read_csv. Put this next to the other csv bug fixes.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

say when these were not closed properlty (e.g. nrows not None, when else?)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we sure that on iteration (e.g. chunksize / iterator specified) we are closing properly? (e.g. I am not sure if it closes, then re-opens each iteration).

cc @gfyoung do you know?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not re-open after each iteration AFAIK. You can see for yourself by writing to a CSV file, opening it with read_csv and passing in chunksize. Try then deleting the file in a separate window. You should be able to do so.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, when getting an iterator (chunksize or iterator specified) we do not close the file. When the iterator has been consumed and raises StopIteration, we close the file. If a user requests an iterator but does not consume it fully, s/he must call .close() on it — though we could define a __del__ and take care of that.

@codecov-io

Current coverage is 85.31% (diff: 61.44%)

Merging #13940 into master will increase coverage by 0.02%

@@ master #13940 diff @@

Files 139 139
Lines 50162 50370 +208
Methods 0 0
Messages 0 0
Branches 0 0

Powered by Codecov. Last update 5df9123...3fa7d25

gfyoung

self._engine._reader.close()
except:
pass
self._engine.close()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can we get rid of this try-except block?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I shouldn't have removed it. But this was only used by CParserWrapper, so I have put it there now in c7e9c9c.

@agraboso

@jreback The mmap one is fixed since the first commit — I was just making the point that it took me a while to understand what was wrong.

I have fixed the stata cases too:

jreback

if read_len <= 0:
# Iterator has finished, should never be here unless
# we are reading the file incrementally
try:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need the try/except over everthing here. pls narrow down as much as possible.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I narrowed it down to the few places where an explicit ValueError might come from another StataReader method. It catches those cases that are in the tests, but it is possible for other lines to raise an error that is not caught — which would leave an open file handler.

@jreback

I don't want to just wrap things at a high level, these should be as granular as possible.These can be actual errors (or errors in the tests, its hard to tell). Let uncaught resource warnings show thru (and then can investigate these individually).

@agraboso

Notice that my proposal of wrapping most of StataReader.read in a try/except block had both a self.close() and a raise, so that we would still see all warnings and errors while making sure we do close the file.

jreback

@@ -253,6 +253,9 @@ def __init__(self, filepath_or_buffer, index=None, encoding='ISO-8859-1',
self._read_header()
def close(self):
self.filepath_or_buffer.close()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this always a filehandle-like?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's either a file handle coming from L244 or a BytesIO object coming from L252. So it is always safe to close it.

@jreback

hah! this turned into a rabbit hole!. lmk when you are ready.

@agraboso

hah! this turned into a rabbit hole!

Tell me about it! It's possible that other parsers leave open files too, but I'd leave that for another PR. I think this one is ready.

lmk when you are ready.

And it's all green!

@jreback

thanks @agraboso I merged. left a couple comments that might be relevant for next PR :)

we need to try to fix remaining ResourceWarnings then raise if ANY are detected (its actually pretty easy). in pandas.util.testing.TestCase.setUpClass you can set the warning to error. This way its not set globally when testing_mode() is called (in case testing is just imported by a user and not actually run in the test suite).

jreback

@@ -393,11 +393,15 @@ def _read(filepath_or_buffer, kwds):
raise NotImplementedError("'nrows' and 'chunksize' cannot be used"
" together yet.")
elif nrows is not None:
return parser.read(nrows)
data = parser.read(nrows)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you do

try:
    return parser.read(nrows)
finally:
    parse.close()

I think you can eliminate the .close() calls where you raise on an exception (also needs changing at L402).
and slightly more idiomatic.

That way exceptions will bubble up, but we will still close.