BUG: read_excel throws FileNotFoundError with s3fs objects · Issue #38788 · pandas-dev/pandas (original) (raw)


Code Sample

I get a FileNotFoundError when running the following code:

import pandas as pd import s3fs fs = s3fs.S3FileSystem() with fs.open('s3://bucket_name/filename.xlsx') as f: pd.read_excel(f) # NOTE: pd.ExcelFile(f) throws same error

Traceback (most recent call last): File "", line 2, in File "/home/airflow/GVIP/venv/lib/python3.8/site-packages/pandas/util/_decorators.py", line 299, in wrapper return func(*args, **kwargs) File "/home/airflow/GVIP/venv/lib/python3.8/site-packages/pandas/io/excel/_base.py", line 336, in read_excel io = ExcelFile(io, storage_options=storage_options, engine=engine) File "/home/airflow/GVIP/venv/lib/python3.8/site-packages/pandas/io/excel/_base.py", line 1062, in init ext = inspect_excel_format( File "/home/airflow/GVIP/venv/lib/python3.8/site-packages/pandas/io/excel/_base.py", line 938, in inspect_excel_format with get_handle( File "/home/airflow/GVIP/venv/lib/python3.8/site-packages/pandas/io/common.py", line 648, in get_handle handle = open(handle, ioargs.mode) FileNotFoundError: [Errno 2] No such file or directory: '<File-like object S3FileSystem, bucket_name/filename.xlsx>'

Problem description

I should be able to read in the File-like object from s3fs when using pd.read_excel or pd.ExcelFile. Pandas 1.1.x allows for this, but it looks like changes to pd.io.common.get_handle in 1.2 have made this impossible. The simple workaround for this is to just use the s3 URI instead of using s3fs to open it first, but to my knowledge, the ability to use read_excel with an s3fs object was not intended to be deprecated in 1.2.

My noob guess on what's going wrong

I'm new to contributing to open source projects, so I don't know exactly how to fix this, but it looks like the issue is that the pd.io.common.get_handle method in 1.2 thinks the s3fs object is a file handle rather than a file-like buffer. To solve this, I would think something similar to the need_text_wrapping boolean option from the get_handle method in 1.1.x needs to be added to 1.2's get_handle in order to tell pandas that the s3fs object needs a TextIOWrapper rather than treating it like a local file handle.

If someone could give me a little guidance on how to fix this, I'd be happy to give my first open-source contribution a go, but if that's not really how this works, I understand.

Expected Output

<class 'pandas.core.frame.DataFrame'>

Output of pd.show_versions()

INSTALLED VERSIONS

commit : 3e89b4c
python : 3.8.0.final.0
python-bits : 64
OS : Linux
OS-release : 4.4.0-197-generic
Version : #229-Ubuntu SMP Wed Nov 25 11:05:42 UTC 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.2.0
numpy : 1.19.4
pytz : 2020.4
dateutil : 2.8.1
pip : 20.3.3
setuptools : 41.2.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : 0.8.5
fastparquet : None
gcsfs : None
matplotlib : 3.3.3
numexpr : None
odfpy : None
openpyxl : 3.0.5
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : 0.5.2
scipy : 1.5.4
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None