BUG: read_excel with openpyxl produces trailing rows of nan by rhshadrach · Pull Request #39547 · pandas-dev/pandas (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation11 Commits26 Checks0 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

rhshadrach

@rhshadrach

@rhshadrach

@rhshadrach

@rhshadrach

@rhshadrach

@rhshadrach

@rhshadrach

…enpyxl_header

� Conflicts: � doc/source/whatsnew/v1.2.2.rst

@rhshadrach

@jreback

why is this not an upstream bug? (not averse to fixing in pandas at least tactically).

@rhshadrach

@jreback openpyxl docs:

Read-only mode relies on applications and libraries that created the file providing correct information about the worksheets, specifically the used part of it, known as the dimensions. Some applications set this incorrectly. You can check the apparent dimensions of a worksheet using ws.calculate_dimension(). If this returns a range that you know is incorrect, say A1:A1 then simply resetting the max_row and max_column attributes should allow you to work with the file:

We could fix this issue by calling sheet.calculate_dimension(force=True) but this involves a 50% performance hit (measured on a frame with 10000 rows and 10 columns). The method used here involves a 2% performance hit. Ref: #39001 (comment)

@rhshadrach

@jreback

@jreback openpyxl docs:

Read-only mode relies on applications and libraries that created the file providing correct information about the worksheets, specifically the used part of it, known as the dimensions. Some applications set this incorrectly. You can check the apparent dimensions of a worksheet using ws.calculate_dimension(). If this returns a range that you know is incorrect, say A1:A1 then simply resetting the max_row and max_column attributes should allow you to work with the file:

We could fix this issue by calling sheet.calculate_dimension(force=True) but this involves a 50% performance hit (measured on a frame with 10000 rows and 10 columns). The method used here involves a 2% performance hit. Ref: #39001 (comment)

ok! yeah as long as its robust ok with this

Barben360

data = data[: last_row_with_data + 1]
# With dimension reset, openpyxl no longer pads rows
max_width = max(len(data_row) for data_row in data)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With some xlsx files I get an error:

max() arg is an empty sequence

I am pretty sure this happens when data object is empty

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't share the file but I can tell it starts with empty rows and empty columns.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep - thanks for catching this!

@rhshadrach

@rhshadrach

@rhshadrach

@rhshadrach

@rhshadrach

@rhshadrach

…enpyxl_workbook

� Conflicts: � doc/source/whatsnew/v1.2.2.rst � pandas/tests/io/excel/test_openpyxl.py

@rhshadrach

@rhshadrach

@rhshadrach

@rhshadrach

@rhshadrach

into openpyxl_nans

� Conflicts: � doc/source/whatsnew/v1.2.2.rst � pandas/io/excel/_openpyxl.py � pandas/tests/io/excel/test_openpyxl.py

@rhshadrach

…enpyxl_nans

� Conflicts: � pandas/io/excel/_openpyxl.py � pandas/tests/io/excel/test_openpyxl.py

@rhshadrach

@jreback

@simonjayhawkins

@rhshadrach

…enpyxl_nans

� Conflicts: � pandas/tests/io/excel/test_openpyxl.py

@rhshadrach

@rhshadrach

@simonjayhawkins

@jreback

==================================== ERRORS ====================================
__________ ERROR collecting scripts/tests/test_validate_docstrings.py __________
ImportError while importing test module '/home/runner/work/pandas/pandas/scripts/tests/test_validate_docstrings.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
/usr/share/miniconda/envs/pandas-dev/lib/python3.8/importlib/__init__.py:127: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
scripts/tests/test_validate_docstrings.py:5: in <module>
    import validate_docstrings
scripts/validate_docstrings.py:50: in <module>
    from numpydoc.validate import validate, Docstring  # isort:skip
E   ImportError: cannot import name 'Docstring' from 'numpydoc.validate' (/usr/share/miniconda/envs/pandas-dev/lib/python3.8/site-packages/numpydoc/validate.py)
=========================== short test summary info ============================
ERROR scripts/tests/test_validate_docstrings.py
!!!!!!!!!!!!!!!!!!!! Interrupted: 1 error during collection !!!!!!!!!!!!!!!!!!!!
=============================== 1 error in 0.54s ===============================

hmm is this a suprious issue on the ci / checks?

@jorisvandenbossche

numpydoc failure is also happening on master, so unrelated

@simonjayhawkins

meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request

Feb 8, 2021

@rhshadrach @meeseeksmachine

jreback pushed a commit that referenced this pull request

Feb 8, 2021

@meeseeksmachine @rhshadrach

…ows of nan (#39679)

Co-authored-by: Richard Shadrach 45562402+rhshadrach@users.noreply.github.com

This was referenced

Apr 26, 2021