Improvements for read_csv from AWS S3 · Issue #11070 · pandas-dev/pandas (original) (raw)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Description

I frequently find myself interacting with CSV files stored in Amazon's S3 service, and have run into a few areas where I think small improvements in read_csv could be a big help.

ENH Enable streaming from S3 #11073 Allow streaming reads
This is the most important improvement for me. The current pandas code downloads the entire file from S3 before passing it into the parser. If I have a 6 GB file in S3, it's much better to not need to download the entire thing just to check the first few rows with the "nrows" keyword to read_csv. Or perhaps I want to process the file one chunk at a time using "chunksize". We can iterate through a file on disk in these ways, but not currently with a file in S3.
ENH Add check for inferred compression before get_filepath_or_buffer #11074 Infer compression type from S3 filenames
If an S3 filename ends with ".gz" or ".bz2", the parser should be able to infer the compression type, just as with a file on disk.
ENH Enable bzip2 streaming for Python 3 #11072 Streaming bz2 reads, C parser bz2 reads
Currently, the C parser refuses open bz2-compressed file objects entirely, and the Python parser decompresses the entire file before continuing, which runs into the same problem with needing to read in a potentially large file before doing any work.
ENH Recognize 's3n' and 's3a' as an S3 address #11071 Recognize "s3n" EDIT: and "s3a"
I've only run into this when using Spark, and I admit I don't fully understand the difference. It seems that S3 files can be accessed via "s3://" or "s3n://" ("S3 native"). It would be useful for pandas to recognize both. Some notes I found: https://wiki.apache.org/hadoop/AmazonS3 http://notes.mindprince.in/2014/08/01/difference-between-s3-block-and-s3-native-filesystem-on-hadoop.html

I will open PRs to address each of these.

Improvements for read_csv from AWS S3 · Issue #11070 · pandas-dev/pandas (original) (raw)

Navigation Menu

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Description