CLN: consider deprecating convert_floats from read_excel · Issue #41127 · pandas-dev/pandas (original) (raw)
As the docs explain, the binary spreadsheet formats (xls, xlsb) store all numbers as floats, so by default pandas tries to convert floats to integers if it doesn’t lose information (1.0 --> 1). "You can pass ``convert_float=False` to disable this behavior, which may give a slight performance improvement." I tested this on four file types for a spreadsheet of ~440,000 cells, and recorded the best times out of 10 repetitions:
File type | convert_floats=True | convert_floats=False | speed up |
---|---|---|---|
xls | 1.081 | 1.036 | 4.2% |
xlsb | 3.413 | 3.357 | 1.6% |
ods | 27.798 | 27.770 | 0.1% |
xlsx | 5.182 | 5.189 | -0.1% |
convert_floats
was probably written for the benefit of .xls files, but the benefit is minor. The .xlsx files even have a slight penalty because openpyxl already converts floats to int where possible, and so pandas converts them back to float if convert_floats=False
.
Since .xlsx files are now the most common spreadsheet format (citation: google search), and convert_floats
only exists for performance, is it time to remove convert_floats
? The spreadsheet engines would keep the behaviour of convert_floats=True
and the argument would be deprecated. This change would simplify all four engines, and if anybody really needs their ints as floats, they can always specify a dtype
. Note: this possible deprecation came up in #8212 (comment) before dtype
was finalized in read_excel
.
I can work on this if the community likes the idea.