ENH Enable streaming from S3 by stephen-hoover · Pull Request #11073 · pandas-dev/pandas (original) (raw)

File reading from AWS S3: Modify the get_filepath_or_buffer function such that it only opens the connection to S3, rather than reading the entire file at once. This allows partial reads (e.g. through the nrows argument) or chunked reading (e.g. through the chunksize argument) without needing to download the entire file first.

I wasn't sure what the best place was to put the OnceThroughKey. (Suggestions for better names welcome.) I don't like putting an entire class inside a function like that, but this keeps the boto dependency contained.

The readline function, and modifying next such that it returns lines, was necessary to allow the Python engine to read uncompressed CSVs.

The Python 2 standard library's gzip module needs a seek and tell function on its inputs, so I reverted to the old behavior there.

Partially addresses #11070 .