CLN/FIX/PERF: Don't buffer entire Stata file into memory by akx · Pull Request #49228 · pandas-dev/pandas (original) (raw)
Fixes #48700
Closes #48922 (supersedes it)
Refs #9245
Refs #37639
Regressed in 6d1541e
- closes StataReader processes whole file before reading in chunks #48700
- Tests added and passed if fixing a bug or adding a new feature
- The existing tests for e.g. roundtripping zstandard-compressed Stata files test this code path.
- Added a test that checks e.g. a fp or BytesIO passed in is the same object the reader reads.
- Added a test that ensures a warning is raised if StataReader isn't being used as a context manager.
- All code checks passed.
- Added type annotations to new arguments/methods/functions.
- Added an entry in the latest
doc/source/whatsnew/vX.X.X.rstfile if fixing a bug or adding a new feature.
This PR is a reworking of my earlier #48922, following @bashtage's suggestions. Unfortunately I couldn't quite make it fit in about 15 lines.
Compared to that PR, StataReader required some refactoring to have it behave as it has done before (see the discussion in the other PR) while also making it possible for it to raise a warning when data is being read without a with block being in use. (This does not take into account an user having code that doesn't use with, but that calls .close() by hand.)
The main refactoring is that all of the internal state of the reader object is now _private, and public bits are accessed via properties; this allows us to make sure the data is actually read when requested, even without reading happening in __init__. (Accessing those properties when the StataReader hasn't been __enter__ed will thus also raise a ResourceWarning.) As a bonus, the properties are now well-typed. Previously, e.g. reader.time_stamp had to rely on inference...
While renaming those properties, I noticed there was a bunch of repeated struct.unpack(...read...)[0] code that would have been black reformatted into very unwieldy lines following the addition of the underscores, so I decided to refactor the repeated code into smaller reading utility functions.
Finally, there's basically the same fix I made in #48922 – if the underlying stream is seekable, we can use it directly.
I also added a commit that removes the implicit automatic closing of the underlying stream, since it shouldn't be necessary anymore. If that should be removed from this PR, let me know.