BUG: read_excel with openpyxl produces trailing rows of nan by rhshadrach · Pull Request #39547 · pandas-dev/pandas (original) (raw)
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Conversation11 Commits26 Checks0 Files changed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
[ Show hidden characters]({{ revealButtonHref }})
…enpyxl_header
� Conflicts: � doc/source/whatsnew/v1.2.2.rst
why is this not an upstream bug? (not averse to fixing in pandas at least tactically).
Read-only mode relies on applications and libraries that created the file providing correct information about the worksheets, specifically the used part of it, known as the dimensions. Some applications set this incorrectly. You can check the apparent dimensions of a worksheet using ws.calculate_dimension(). If this returns a range that you know is incorrect, say A1:A1 then simply resetting the max_row and max_column attributes should allow you to work with the file:
We could fix this issue by calling sheet.calculate_dimension(force=True)
but this involves a 50% performance hit (measured on a frame with 10000 rows and 10 columns). The method used here involves a 2% performance hit. Ref: #39001 (comment)
Read-only mode relies on applications and libraries that created the file providing correct information about the worksheets, specifically the used part of it, known as the dimensions. Some applications set this incorrectly. You can check the apparent dimensions of a worksheet using ws.calculate_dimension(). If this returns a range that you know is incorrect, say A1:A1 then simply resetting the max_row and max_column attributes should allow you to work with the file:
We could fix this issue by calling
sheet.calculate_dimension(force=True)
but this involves a 50% performance hit (measured on a frame with 10000 rows and 10 columns). The method used here involves a 2% performance hit. Ref: #39001 (comment)
ok! yeah as long as its robust ok with this
data = data[: last_row_with_data + 1] |
---|
# With dimension reset, openpyxl no longer pads rows |
max_width = max(len(data_row) for data_row in data) |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With some xlsx files I get an error:
max() arg is an empty sequence
I am pretty sure this happens when data object is empty
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't share the file but I can tell it starts with empty rows and empty columns.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep - thanks for catching this!
…enpyxl_workbook
� Conflicts: � doc/source/whatsnew/v1.2.2.rst � pandas/tests/io/excel/test_openpyxl.py
� Conflicts: � doc/source/whatsnew/v1.2.2.rst � pandas/io/excel/_openpyxl.py � pandas/tests/io/excel/test_openpyxl.py
…enpyxl_nans
� Conflicts: � pandas/io/excel/_openpyxl.py � pandas/tests/io/excel/test_openpyxl.py
…enpyxl_nans
� Conflicts: � pandas/tests/io/excel/test_openpyxl.py
==================================== ERRORS ====================================
__________ ERROR collecting scripts/tests/test_validate_docstrings.py __________
ImportError while importing test module '/home/runner/work/pandas/pandas/scripts/tests/test_validate_docstrings.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
/usr/share/miniconda/envs/pandas-dev/lib/python3.8/importlib/__init__.py:127: in import_module
return _bootstrap._gcd_import(name[level:], package, level)
scripts/tests/test_validate_docstrings.py:5: in <module>
import validate_docstrings
scripts/validate_docstrings.py:50: in <module>
from numpydoc.validate import validate, Docstring # isort:skip
E ImportError: cannot import name 'Docstring' from 'numpydoc.validate' (/usr/share/miniconda/envs/pandas-dev/lib/python3.8/site-packages/numpydoc/validate.py)
=========================== short test summary info ============================
ERROR scripts/tests/test_validate_docstrings.py
!!!!!!!!!!!!!!!!!!!! Interrupted: 1 error during collection !!!!!!!!!!!!!!!!!!!!
=============================== 1 error in 0.54s ===============================
hmm is this a suprious issue on the ci / checks?
numpydoc failure is also happening on master, so unrelated
meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request
jreback pushed a commit that referenced this pull request
…ows of nan (#39679)
Co-authored-by: Richard Shadrach 45562402+rhshadrach@users.noreply.github.com
This was referenced
Apr 26, 2021