PERF: fast inf checking in to_excel by chris-b1 · Pull Request #11352 · pandas-dev/pandas (original) (raw)
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Conversation5 Commits1 Checks0 Files changed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
[ Show hidden characters]({{ revealButtonHref }})
Adds new functions to check for infinity rather than calling np.isposinf
and np.isneginf
, which were (surprising to me) a significant drag on to_excel
.
I also added xlsxwriter
to the asv build configuration. openpyxl
is still failing, I'm assuming something do with the specific version on conda?
before after ratio
[472e6e0e] [8002555d]
156.36ms 154.16ms 0.99 packers.packers_read_excel.time_packers_read_excel
failed failed n/a packers.packers_write_excel_openpyxl.time_packers_write_excel_openpyxl
469.69ms 357.49ms 0.76 packers.packers_write_excel_xlsxwriter.time_packers_write_excel_xlsxwriter
368.96ms 270.60ms 0.73 packers.packers_write_excel_xlwt.time_packers_write_excel_xlwt
I'd rather make a bigger change to how the xls writing is done, to make it much more performant. The idea is to copy what is done for csv writing.
You chunk write, then convert the columns by blocks with to_native_types()
(you would have to expand this to for example handle the inf
conversions, it already handles the nan
and float_format
for FloatBlock
). these already return object
arrays.
Then you simply iterate and yield the ExcelCells
, but no conversion is done (as it was all done before), so should be faster/simpler.
Yeah, I had actually started looking into that - this particular change was low hanging fruit, so I thought it might make sense to do first.
There probably is some performance to be picked up still, but an awful lot of the time is spent in the actual writing steps now.
In [1]: df = pd.DataFrame({'b': np.linspace(0, 1000, 100000), 'c': np.linspace(0, 100, 100000)})
In [2]: %prun df.to_excel('temp.xlsx')
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.707 0.707 5.212 5.212 excel.py:1410(write_cells)
300000 0.699 0.000 1.773 0.000 xmlwriter.py:129(_xml_number_element)
1842 0.612 0.000 0.612 0.000 {built-in method compress}
300001 0.568 0.000 0.685 0.000 format.py:1817(_format_regular_rows)
300000 0.543 0.000 1.092 0.000 worksheet.py:482(write_number)
300002 0.542 0.000 2.490 0.000 worksheet.py:5085(_write_cell)
1 0.512 0.512 3.912 3.912 worksheet.py:4911(_write_rows)
600004/300002 0.471 0.000 2.439 0.000 worksheet.py:51(cell_wrapper)
300002 0.430 0.000 2.175 0.000 worksheet.py:349(write)
300002 0.400 0.000 0.400 0.000 {pandas.json.dumps}
1500255 0.398 0.000 0.398 0.000 {isinstance}
502071 0.390 0.000 0.390 0.000 {method 'write' of 'file' objects}
500157 0.285 0.000 0.923 0.000 codecs.py:353(write)
300002 0.267 0.000 0.267 0.000 worksheet.py:3459(_check_dimensions)
11 0.267 0.024 0.267 0.024 {method 'close' of 'file' objects}
300002 0.265 0.000 0.388 0.000 format.py:1707(_format_value)
500157 0.252 0.000 0.252 0.000 {_codecs.utf_8_encode}
400143 0.238 0.000 0.365 0.000 xmlwriter.py:180(_escape_attributes)
300003 0.207 0.000 1.280 0.000 format.py:1931(get_formatted_cells)
500157 0.188 0.000 1.111 0.000 codecs.py:692(write)
300002 0.184 0.000 0.225 0.000 excel.py:494(_conv_value)
300002 0.157 0.000 0.302 0.000 utility.py:604(supported_datetime)
300002 0.136 0.000 0.136 0.000 utility.py:37(xl_rowcol_to_cell_fast)
100001 0.134 0.000 0.507 0.000 worksheet.py:5038(_write_row)
100001 0.132 0.000 0.354 0.000 xmlwriter.py:61(_xml_start_tag_unencoded)
400167 0.127 0.000 0.127 0.000 {method 'search' of '_sre.SRE_Pattern' objects}
300002 0.116 0.000 0.116 0.000 format.py:1644(__init__)
1 0.116 0.116 0.136 0.136 worksheet.py:4989(_calculate_spans)
300000 0.111 0.000 0.189 0.000 <string>:8(__new__)
300004 0.077 0.000 0.077 0.000 {built-in method __new__ of type object at 0x000000001E2A0670}
100037 0.059 0.000 0.238 0.000 xmlwriter.py:70(_xml_end_tag)
300000 0.054 0.000 0.054 0.000 worksheet.py:4515(_isinf)
@chris-b1 agreed. Ok rebase this (prob a whatsnew conflict). The can look at other improvements if desired.
jreback added a commit that referenced this pull request
PERF: fast inf checking in to_excel