PERF: fast inf checking in to_excel by chris-b1 · Pull Request #11352 · pandas-dev/pandas (original) (raw)

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

Adds new functions to check for infinity rather than calling np.isposinf and np.isneginf, which were (surprising to me) a significant drag on to_excel.

I also added xlsxwriter to the asv build configuration. openpyxl is still failing, I'm assuming something do with the specific version on conda?

    before     after       ratio
  [472e6e0e] [8002555d]
   156.36ms   154.16ms      0.99  packers.packers_read_excel.time_packers_read_excel
     failed     failed       n/a  packers.packers_write_excel_openpyxl.time_packers_write_excel_openpyxl
   469.69ms   357.49ms      0.76  packers.packers_write_excel_xlsxwriter.time_packers_write_excel_xlsxwriter
   368.96ms   270.60ms      0.73  packers.packers_write_excel_xlwt.time_packers_write_excel_xlwt

I'd rather make a bigger change to how the xls writing is done, to make it much more performant. The idea is to copy what is done for csv writing.

You chunk write, then convert the columns by blocks with to_native_types() (you would have to expand this to for example handle the inf conversions, it already handles the nan and float_format for FloatBlock). these already return object arrays.

Then you simply iterate and yield the ExcelCells, but no conversion is done (as it was all done before), so should be faster/simpler.

Yeah, I had actually started looking into that - this particular change was low hanging fruit, so I thought it might make sense to do first.

There probably is some performance to be picked up still, but an awful lot of the time is spent in the actual writing steps now.

In [1]: df = pd.DataFrame({'b': np.linspace(0, 1000, 100000), 'c': np.linspace(0, 100, 100000)})

In [2]: %prun df.to_excel('temp.xlsx')


   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.707    0.707    5.212    5.212 excel.py:1410(write_cells)
   300000    0.699    0.000    1.773    0.000 xmlwriter.py:129(_xml_number_element)
     1842    0.612    0.000    0.612    0.000 {built-in method compress}
   300001    0.568    0.000    0.685    0.000 format.py:1817(_format_regular_rows)
   300000    0.543    0.000    1.092    0.000 worksheet.py:482(write_number)
   300002    0.542    0.000    2.490    0.000 worksheet.py:5085(_write_cell)
        1    0.512    0.512    3.912    3.912 worksheet.py:4911(_write_rows)
600004/300002    0.471    0.000    2.439    0.000 worksheet.py:51(cell_wrapper)
   300002    0.430    0.000    2.175    0.000 worksheet.py:349(write)
   300002    0.400    0.000    0.400    0.000 {pandas.json.dumps}
  1500255    0.398    0.000    0.398    0.000 {isinstance}
   502071    0.390    0.000    0.390    0.000 {method 'write' of 'file' objects}
   500157    0.285    0.000    0.923    0.000 codecs.py:353(write)
   300002    0.267    0.000    0.267    0.000 worksheet.py:3459(_check_dimensions)
       11    0.267    0.024    0.267    0.024 {method 'close' of 'file' objects}
   300002    0.265    0.000    0.388    0.000 format.py:1707(_format_value)
   500157    0.252    0.000    0.252    0.000 {_codecs.utf_8_encode}
   400143    0.238    0.000    0.365    0.000 xmlwriter.py:180(_escape_attributes)
   300003    0.207    0.000    1.280    0.000 format.py:1931(get_formatted_cells)
   500157    0.188    0.000    1.111    0.000 codecs.py:692(write)
   300002    0.184    0.000    0.225    0.000 excel.py:494(_conv_value)
   300002    0.157    0.000    0.302    0.000 utility.py:604(supported_datetime)
   300002    0.136    0.000    0.136    0.000 utility.py:37(xl_rowcol_to_cell_fast)
   100001    0.134    0.000    0.507    0.000 worksheet.py:5038(_write_row)
   100001    0.132    0.000    0.354    0.000 xmlwriter.py:61(_xml_start_tag_unencoded)
   400167    0.127    0.000    0.127    0.000 {method 'search' of '_sre.SRE_Pattern' objects}
   300002    0.116    0.000    0.116    0.000 format.py:1644(__init__)
        1    0.116    0.116    0.136    0.136 worksheet.py:4989(_calculate_spans)
   300000    0.111    0.000    0.189    0.000 <string>:8(__new__)
   300004    0.077    0.000    0.077    0.000 {built-in method __new__ of type object at 0x000000001E2A0670}
   100037    0.059    0.000    0.238    0.000 xmlwriter.py:70(_xml_end_tag)
   300000    0.054    0.000    0.054    0.000 worksheet.py:4515(_isinf)

@chris-b1 agreed. Ok rebase this (prob a whatsnew conflict). The can look at other improvements if desired.

jreback added a commit that referenced this pull request

Oct 17, 2015

PERF: fast inf checking in to_excel

PERF: fast inf checking in to_excel by chris-b1 · Pull Request #11352 · pandas-dev/pandas (original) (raw)

Conversation

Labels