ENH read_excel error when accessing AWS S3 URL · Issue #11447 · pandas-dev/pandas (original) (raw)

Summary: read_excel is unable to read a file using the same S3 URL syntax as read_csv. read_excel should support accessing S3 data in the same manner as read_csv

read_excel fails with the following error:

import pandas as pd df = pd.read_excel("s3://my-bucket/my_file.xlsx") Traceback (most recent call last): File "", line 1, in File "/usr/local/lib64/python2.6/site-packages/pandas/io/excel.py", line 163, in read_excel io = ExcelFile(io, engine=engine) File "/usr/local/lib64/python2.6/site-packages/pandas/io/excel.py", line 206, in init self.book = xlrd.open_workbook(io) File "/usr/local/lib/python2.6/site-packages/xlrd/init.py", line 394, in open_workbook f = open(filename, "rb") IOError: [Errno 2] No such file or directory: 's3://my-bucket/my_file.xlsx'

read_csv on the other hand is able to successfully read a csv file in the same S3 bucket using the same URL syntax:

import pandas as pd df = pd.read_csv("s3://my-bucket/my_file.csv") len(df.index) 1187

For the record, read_csv can also see the xlsx file but returns parse errors when attempting to tokenize the data.

import pandas as pd df = pd.read_csv("s3://my-bucket/my_file.xlsx") Exception pandas.parser.CParserError: CParserError('Error tokenizing data. C error: Expected 9 fields in line 210, saw 10\n',) in 'pandas.parser.TextReader._tokenize_rows' ignored

read_excel successfully reads and parses a local copy of the xlsx file

import pandas as pd df = pd.read_excel("my_file.xlsx") len(df.index) 221

Pandas version string and dependencies:

pd.show_versions()

INSTALLED VERSIONS

commit: None python: 2.6.9.final.0 python-bits: 64 OS: Linux OS-release: 3.14.48-33.39.amzn1.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8

pandas: 0.17.0 nose: 1.3.4 pip: 6.1.1 setuptools: 12.2 Cython: None numpy: 1.10.1 scipy: 0.16.0 statsmodels: None IPython: None sphinx: None patsy: None dateutil: 2.4.2 pytz: 2015.7 blosc: None bottleneck: None tables: None numexpr: None matplotlib: None openpyxl: None xlrd: 0.9.4 xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: None httplib2: None apiclient: None sqlalchemy: None pymysql: None psycopg2: None