BUG: read_excel with openpyxl produces trailing rows of nan by rhshadrach · Pull Request #39547 · pandas-dev/pandas (original) (raw)

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

…enpyxl_header

� Conflicts: � doc/source/whatsnew/v1.2.2.rst

why is this not an upstream bug? (not averse to fixing in pandas at least tactically).

@jreback openpyxl docs:

Read-only mode relies on applications and libraries that created the file providing correct information about the worksheets, specifically the used part of it, known as the dimensions. Some applications set this incorrectly. You can check the apparent dimensions of a worksheet using ws.calculate_dimension(). If this returns a range that you know is incorrect, say A1:A1 then simply resetting the max_row and max_column attributes should allow you to work with the file:

We could fix this issue by calling sheet.calculate_dimension(force=True) but this involves a 50% performance hit (measured on a frame with 10000 rows and 10 columns). The method used here involves a 2% performance hit. Ref: #39001 (comment)

@jreback openpyxl docs:

Read-only mode relies on applications and libraries that created the file providing correct information about the worksheets, specifically the used part of it, known as the dimensions. Some applications set this incorrectly. You can check the apparent dimensions of a worksheet using ws.calculate_dimension(). If this returns a range that you know is incorrect, say A1:A1 then simply resetting the max_row and max_column attributes should allow you to work with the file:

We could fix this issue by calling sheet.calculate_dimension(force=True) but this involves a 50% performance hit (measured on a frame with 10000 rows and 10 columns). The method used here involves a 2% performance hit. Ref: #39001 (comment)

ok! yeah as long as its robust ok with this

data = data[: last_row_with_data + 1]

# With dimension reset, openpyxl no longer pads rows
max_width = max(len(data_row) for data_row in data)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With some xlsx files I get an error:

max() arg is an empty sequence

I am pretty sure this happens when data object is empty

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't share the file but I can tell it starts with empty rows and empty columns.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep - thanks for catching this!

…enpyxl_workbook

� Conflicts: � doc/source/whatsnew/v1.2.2.rst � pandas/tests/io/excel/test_openpyxl.py

into openpyxl_nans

� Conflicts: � doc/source/whatsnew/v1.2.2.rst � pandas/io/excel/_openpyxl.py � pandas/tests/io/excel/test_openpyxl.py

…enpyxl_nans

� Conflicts: � pandas/io/excel/_openpyxl.py � pandas/tests/io/excel/test_openpyxl.py

…enpyxl_nans

� Conflicts: � pandas/tests/io/excel/test_openpyxl.py

==================================== ERRORS ====================================
__________ ERROR collecting scripts/tests/test_validate_docstrings.py __________
ImportError while importing test module '/home/runner/work/pandas/pandas/scripts/tests/test_validate_docstrings.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
/usr/share/miniconda/envs/pandas-dev/lib/python3.8/importlib/__init__.py:127: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
scripts/tests/test_validate_docstrings.py:5: in <module>
    import validate_docstrings
scripts/validate_docstrings.py:50: in <module>
    from numpydoc.validate import validate, Docstring  # isort:skip
E   ImportError: cannot import name 'Docstring' from 'numpydoc.validate' (/usr/share/miniconda/envs/pandas-dev/lib/python3.8/site-packages/numpydoc/validate.py)
=========================== short test summary info ============================
ERROR scripts/tests/test_validate_docstrings.py
!!!!!!!!!!!!!!!!!!!! Interrupted: 1 error during collection !!!!!!!!!!!!!!!!!!!!
=============================== 1 error in 0.54s ===============================

hmm is this a suprious issue on the ci / checks?