BUG: properly close files opened by parsers by agraboso · Pull Request #13940 · pandas-dev/pandas (original) (raw)
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Conversation28 Commits14 Checks0 Files changed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
[ Show hidden characters]({{ revealButtonHref }})
- closes CLN: resource warnings #13932
- tests passed
- passes
git diff upstream/master | flake8 --diff
- whatsnew entry
Addresses all most of the ResourceWarning
s from #13932. There were three tests that opened files but did not close them. The rest were due to parsers (CParserWrapper
and PythonParser
) opening files or streams but never closing them. There was one case — the mmap
one — that was tricky because there was a file handle going into it and immediately reaching a refcount of zero.
I am at a loss about writing new tests for this. Is it even possible?
Argh, there are still some ResourceWarning
s left... Let me investigate.
@@ -972,3 +972,5 @@ Bug Fixes |
---|
- Bug in ``Index`` raises ``KeyError`` displaying incorrect column when column is not in the df and columns contains duplicate values (:issue:`13822`) |
- Bug in ``Period`` and ``PeriodIndex`` creating wrong dates when frequency has combined offset aliases (:issue:`13874`) |
- Bug where files opened by parsers (used by `pd.read_csv`, e.g.) were not closed properly. (:issue:`13940`) |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
double back ticks around pd.read_csv
. Put this next to the other csv bug fixes.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
say when these were not closed properlty (e.g. nrows not None, when else?)
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we sure that on iteration (e.g. chunksize / iterator specified) we are closing properly? (e.g. I am not sure if it closes, then re-opens each iteration).
cc @gfyoung do you know?
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not re-open after each iteration AFAIK. You can see for yourself by writing to a CSV file, opening it with read_csv
and passing in chunksize
. Try then deleting the file in a separate window. You should be able to do so.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, when getting an iterator (chunksize
or iterator
specified) we do not close the file. When the iterator has been consumed and raises StopIteration
, we close the file. If a user requests an iterator but does not consume it fully, s/he must call .close() on it — though we could define a __del__
and take care of that.
Current coverage is 85.31% (diff: 61.44%)
@@ master #13940 diff @@
Files 139 139
Lines 50162 50370 +208
Methods 0 0
Messages 0 0
Branches 0 0
- Hits 42785 42975 +190
- Misses 7377 7395 +18
Partials 0 0
Powered by Codecov. Last update 5df9123...3fa7d25
self._engine._reader.close() |
---|
except: |
pass |
self._engine.close() |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can we get rid of this try
-except
block?
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I shouldn't have removed it. But this was only used by CParserWrapper
, so I have put it there now in c7e9c9c.
@jreback The mmap
one is fixed since the first commit — I was just making the point that it took me a while to understand what was wrong.
I have fixed the stata
cases too:
StataWriter
was easy: I moved the opening of the file to the write_file method. The file is thus always opened and closed within it, whether errors are raised or not.StataReader
did not close the file when raising an error. I fixed it by wrapping most of the logic in its read method in atry/except
block. This even takes care of closing the file when aStopIteration
is raised.
if read_len <= 0: |
---|
# Iterator has finished, should never be here unless |
# we are reading the file incrementally |
try: |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't need the try/except over everthing here. pls narrow down as much as possible.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I narrowed it down to the few places where an explicit ValueError
might come from another StataReader
method. It catches those cases that are in the tests, but it is possible for other lines to raise an error that is not caught — which would leave an open file handler.
I don't want to just wrap things at a high level, these should be as granular as possible.These can be actual errors (or errors in the tests, its hard to tell). Let uncaught resource warnings show thru (and then can investigate these individually).
Notice that my proposal of wrapping most of StataReader.read
in a try/except
block had both a self.close()
and a raise
, so that we would still see all warnings and errors while making sure we do close the file.
@@ -253,6 +253,9 @@ def __init__(self, filepath_or_buffer, index=None, encoding='ISO-8859-1', |
---|
self._read_header() |
def close(self): |
self.filepath_or_buffer.close() |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this always a filehandle-like?
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's either a file handle coming from L244 or a BytesIO
object coming from L252. So it is always safe to close it.
hah! this turned into a rabbit hole!. lmk when you are ready.
hah! this turned into a rabbit hole!
Tell me about it! It's possible that other parsers leave open files too, but I'd leave that for another PR. I think this one is ready.
lmk when you are ready.
And it's all green!
thanks @agraboso I merged. left a couple comments that might be relevant for next PR :)
we need to try to fix remaining ResourceWarnings
then raise if ANY are detected (its actually pretty easy). in pandas.util.testing.TestCase.setUpClass
you can set the warning to error. This way its not set globally when testing_mode()
is called (in case testing is just imported by a user and not actually run in the test suite).
@@ -393,11 +393,15 @@ def _read(filepath_or_buffer, kwds): |
---|
raise NotImplementedError("'nrows' and 'chunksize' cannot be used" |
" together yet.") |
elif nrows is not None: |
return parser.read(nrows) |
data = parser.read(nrows) |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you do
try:
return parser.read(nrows)
finally:
parse.close()
I think you can eliminate the .close()
calls where you raise on an exception (also needs changing at L402).
and slightly more idiomatic.
That way exceptions will bubble up, but we will still close.