PERF: fast inf checking in to_excel by chris-b1 · Pull Request #11352 · pandas-dev/pandas (original) (raw)
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
[ Show hidden characters]({{ revealButtonHref }})
Adds new functions to check for infinity rather than calling np.isposinf and np.isneginf, which were (surprising to me) a significant drag on to_excel.
I also added xlsxwriter to the asv build configuration. openpyxl is still failing, I'm assuming something do with the specific version on conda?
before after ratio
[472e6e0e] [8002555d]
156.36ms 154.16ms 0.99 packers.packers_read_excel.time_packers_read_excel
failed failed n/a packers.packers_write_excel_openpyxl.time_packers_write_excel_openpyxl
469.69ms 357.49ms 0.76 packers.packers_write_excel_xlsxwriter.time_packers_write_excel_xlsxwriter
368.96ms 270.60ms 0.73 packers.packers_write_excel_xlwt.time_packers_write_excel_xlwt
I'd rather make a bigger change to how the xls writing is done, to make it much more performant. The idea is to copy what is done for csv writing.
You chunk write, then convert the columns by blocks with to_native_types() (you would have to expand this to for example handle the inf conversions, it already handles the nan and float_format for FloatBlock). these already return object arrays.
Then you simply iterate and yield the ExcelCells, but no conversion is done (as it was all done before), so should be faster/simpler.
Yeah, I had actually started looking into that - this particular change was low hanging fruit, so I thought it might make sense to do first.
There probably is some performance to be picked up still, but an awful lot of the time is spent in the actual writing steps now.
In [1]: df = pd.DataFrame({'b': np.linspace(0, 1000, 100000), 'c': np.linspace(0, 100, 100000)})
In [2]: %prun df.to_excel('temp.xlsx')
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.707 0.707 5.212 5.212 excel.py:1410(write_cells)
300000 0.699 0.000 1.773 0.000 xmlwriter.py:129(_xml_number_element)
1842 0.612 0.000 0.612 0.000 {built-in method compress}
300001 0.568 0.000 0.685 0.000 format.py:1817(_format_regular_rows)
300000 0.543 0.000 1.092 0.000 worksheet.py:482(write_number)
300002 0.542 0.000 2.490 0.000 worksheet.py:5085(_write_cell)
1 0.512 0.512 3.912 3.912 worksheet.py:4911(_write_rows)
600004/300002 0.471 0.000 2.439 0.000 worksheet.py:51(cell_wrapper)
300002 0.430 0.000 2.175 0.000 worksheet.py:349(write)
300002 0.400 0.000 0.400 0.000 {pandas.json.dumps}
1500255 0.398 0.000 0.398 0.000 {isinstance}
502071 0.390 0.000 0.390 0.000 {method 'write' of 'file' objects}
500157 0.285 0.000 0.923 0.000 codecs.py:353(write)
300002 0.267 0.000 0.267 0.000 worksheet.py:3459(_check_dimensions)
11 0.267 0.024 0.267 0.024 {method 'close' of 'file' objects}
300002 0.265 0.000 0.388 0.000 format.py:1707(_format_value)
500157 0.252 0.000 0.252 0.000 {_codecs.utf_8_encode}
400143 0.238 0.000 0.365 0.000 xmlwriter.py:180(_escape_attributes)
300003 0.207 0.000 1.280 0.000 format.py:1931(get_formatted_cells)
500157 0.188 0.000 1.111 0.000 codecs.py:692(write)
300002 0.184 0.000 0.225 0.000 excel.py:494(_conv_value)
300002 0.157 0.000 0.302 0.000 utility.py:604(supported_datetime)
300002 0.136 0.000 0.136 0.000 utility.py:37(xl_rowcol_to_cell_fast)
100001 0.134 0.000 0.507 0.000 worksheet.py:5038(_write_row)
100001 0.132 0.000 0.354 0.000 xmlwriter.py:61(_xml_start_tag_unencoded)
400167 0.127 0.000 0.127 0.000 {method 'search' of '_sre.SRE_Pattern' objects}
300002 0.116 0.000 0.116 0.000 format.py:1644(__init__)
1 0.116 0.116 0.136 0.136 worksheet.py:4989(_calculate_spans)
300000 0.111 0.000 0.189 0.000 <string>:8(__new__)
300004 0.077 0.000 0.077 0.000 {built-in method __new__ of type object at 0x000000001E2A0670}
100037 0.059 0.000 0.238 0.000 xmlwriter.py:70(_xml_end_tag)
300000 0.054 0.000 0.054 0.000 worksheet.py:4515(_isinf)
@chris-b1 agreed. Ok rebase this (prob a whatsnew conflict). The can look at other improvements if desired.
jreback added a commit that referenced this pull request
PERF: fast inf checking in to_excel