BUG: some read_excel engines still load trailing blank cells · Issue #41167 · pandas-dev/pandas (original) (raw)


Code Sample, a copy-pastable example

pd.read_excel('trailing_blanks.xlsx').shape

Test files available here.

trailing_blanks

I loaded this sample spreadsheet in all five filetypes to see what shape is returned:

filetype pandas 1.2.x master
xls (3, 3) (3, 3)
xlsx (11, 6) (3, 6)
xlsm (11, 6) (3, 6)
xlsb (11, 6) (11, 6)
ods (3, 3) (3, 3)

Expected Output

(3, 3)

Problem description

Spreadsheet files can contain cells with no values, such as the formatted blank cells shown below. In the worst cases, the cell used to contain a value or formatting but currently contains neither and is thus invisible to the user of the spreadsheet application. The trailing rows of NaNs in xls[x|m] files were recently resolved in #39547 by @rhshadrach but the additional columns still remain. In a case of the size pictured here, it's a minor annoyance, but I've come across spreadsheets in the wild where one more cells were unintentionally created on row 2**20 or column 2**14, which can create a severe performance issues. To fix the performance issue, the empty rows/columns must be trimmed before the data is passed to TextParser.

I suspect that these bug reports are related:
#40569
#40976

Output of pd.show_versions()

INSTALLED VERSIONS

commit : 1e91282
python : 3.8.6.final.0
python-bits : 64
OS : Linux
OS-release : 5.8.0-50-generic
Version : #56-Ubuntu SMP Mon Apr 12 17🔞36 UTC 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_CA.UTF-8
LOCALE : en_CA.UTF-8

pandas : 1.3.0.dev0+1412.g1e912821c0
numpy : 1.19.5
pytz : 2021.1
dateutil : 2.8.1
pip : 21.0.1
setuptools : 49.6.0.post20210108
Cython : 0.29.21
pytest : 6.2.2
hypothesis : 6.1.1
sphinx : 3.4.3
blosc : None
feather : None
xlsxwriter : 1.3.7
lxml.etree : 4.6.2
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.3
IPython : 7.20.0
pandas_datareader: None
bs4 : 4.9.3
bottleneck : 1.3.2
fsspec : 0.8.5
fastparquet : 0.5.0
gcsfs : 0.7.1
matplotlib : 3.3.4
numexpr : 2.7.2
odfpy : None
openpyxl : 3.0.6
pandas_gbq : None
pyarrow : 3.0.0
pyxlsb : 1.0.8
s3fs : 0.5.2
scipy : 1.6.0
sqlalchemy : 1.3.23
tables : 3.6.1
tabulate : 0.8.7
xarray : 0.16.2
xlrd : 2.0.1
xlwt : 1.3.0
numba : 0.52.0