pd.DataFrame.to_csv('filename.zip') doesn't extract with a '.csv' extension · Issue #26023 · pandas-dev/pandas (original) (raw)
Code Sample, a copy-pastable example if possible
df = pd.DataFrame({'name': ['Raphael', 'Donatello'], 'mask': ['red', 'purple']})
When trying to create a compressed csv, these give odd results.
df.to_csv('out.csv', compression='zip') # --> zip file named 'out.csv' containing csv file 'out.csv' df.to_csv('out.zip') # --> zip file named 'out.zip' containing csv file 'out.zip' df.to_csv('out.csv.zip') # --> zip file named 'out.csv.zip' containing csv file 'out.csv.zip'
This would be the desired behaviour, if we had an 'arcname'
parameter like zipfile.ZipFile.write(arcname, data)
df.to_csv('out.zip', arcname='data.csv') # --> zip file named 'out.zip' containing csv file 'data.csv'
Problem description
When pd.DataFrame.to_csv
creates compressed zip files, the name of the csv file inside the archive is always the same as the name of the zip archive file itself. This is obviously problematic because the archive has a .zip
extension but we want the csv file to have a .csv
extension when it is extracted.
Other compression methods meant for a single file like 'bz2', 'gzip', and 'xz' do not have this problem because a file 'file.csv.gz' for instance, will automatically become 'file.csv' when decompressed.
This would be a relatively easy fix by adding an arcname=None
parameter to to_csv
, passing it through pandas.io.formats.csvs.CSVFormatter
to pandas.io.formats.csvs._get_handle
and using that instead of ZipFile.filename
if provided.
Expected Output
See comments in Code Sample above for expected output.
Output of pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.7.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.18.0-17-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.24.1
pytest: 4.3.0
pip: 19.0.3
setuptools: 40.0.0
Cython: 0.28.5
numpy: 1.16.2
scipy: 1.2.1
pyarrow: 0.11.1
xarray: None
IPython: 7.1.1
sphinx: 1.8.5
patsy: 0.5.1
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 3.0.3
openpyxl: 2.4.11
xlrd: 1.1.0
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: 1.2.11
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: 0.2.1
pandas_gbq: None
pandas_datareader: None
gcsfs: None