pd.DataFrame.to_csv('filename.zip') doesn't extract with a '.csv' extension (original) (raw)

Code Sample, a copy-pastable example if possible

df = pd.DataFrame({'name': ['Raphael', 'Donatello'], 'mask': ['red', 'purple']})

When trying to create a compressed csv, these give odd results.

df.to_csv('out.csv', compression='zip') # --> zip file named 'out.csv' containing csv file 'out.csv' df.to_csv('out.zip') # --> zip file named 'out.zip' containing csv file 'out.zip' df.to_csv('out.csv.zip') # --> zip file named 'out.csv.zip' containing csv file 'out.csv.zip'

This would be the desired behaviour, if we had an 'arcname'

parameter like zipfile.ZipFile.write(arcname, data)

df.to_csv('out.zip', arcname='data.csv') # --> zip file named 'out.zip' containing csv file 'data.csv'

Problem description

When pd.DataFrame.to_csv creates compressed zip files, the name of the csv file inside the archive is always the same as the name of the zip archive file itself. This is obviously problematic because the archive has a .zip extension but we want the csv file to have a .csv extension when it is extracted.

Other compression methods meant for a single file like 'bz2', 'gzip', and 'xz' do not have this problem because a file 'file.csv.gz' for instance, will automatically become 'file.csv' when decompressed.

This would be a relatively easy fix by adding an arcname=None parameter to to_csv, passing it through pandas.io.formats.csvs.CSVFormatter to pandas.io.formats.csvs._get_handle and using that instead of ZipFile.filename if provided.

Expected Output

See comments in Code Sample above for expected output.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.7.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.18.0-17-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.1
pytest: 4.3.0
pip: 19.0.3
setuptools: 40.0.0
Cython: 0.28.5
numpy: 1.16.2
scipy: 1.2.1
pyarrow: 0.11.1
xarray: None
IPython: 7.1.1
sphinx: 1.8.5
patsy: 0.5.1
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 3.0.3
openpyxl: 2.4.11
xlrd: 1.1.0
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: 1.2.11
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: 0.2.1
pandas_gbq: None
pandas_datareader: None
gcsfs: None