performance regression in ewm.corr(pairwise=True) (original) (raw)

Problem description

The deprecation of Panel in the 0.20.x releases has introduced a severe performance regression in ewm.corr(pairwise=True) for a common case when this function is called on a long time series (e.g. a dataframe with 1 million rows and 6 columns). The issue is the last 3 lines of code in this section of core/window.py:

            # TODO: not the most efficient (perf-wise)
            # though not bad code-wise
            from pandas import Panel, MultiIndex, concat

            with warnings.catch_warnings(record=True):
                p = Panel.from_dict(results).swapaxes('items', 'major')
                if len(p.major_axis) > 0:
                    p.major_axis = arg1.columns[p.major_axis]
                if len(p.minor_axis) > 0:
                    p.minor_axis = arg2.columns[p.minor_axis]

            if len(p.items):
                result = concat(
                    [p.iloc[i].T for i in range(len(p.items))],
                    keys=p.items)

The result is converted from a Panel to a DataFrame by running a concat along an axis that is typically very long. This is killing performance for me compared to the 0.19 releases.

My solution was to replace the last 3 lines with:

                  result = DataFrame(
                      p.values.reshape((p.shape[0], p.shape[1]*p.shape[2])),
                      index=p.items,
                      columns=MultiIndex.from_product((arg1.columns, arg2.columns))
                  )
                  result = result.stack(dropna=False)

This works for me but I'm no pandas internals expert, so perhaps this solution does not work in all cases.

I'd really appreciate getting a workaround in though - clearly from the TODO comment the developers were at least aware this was a performance issue when they added the code. I'm happy to rejig the above if it has problems.

Output of `pd.show_versions()`

Details

INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-327.22.2.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.20.3
pytest: 2.8.5
pip: 9.0.1
setuptools: 26.1.1
Cython: 0.26
numpy: 1.13.3
scipy: 0.19.1
xarray: 0.8.2
IPython: 6.0.0
sphinx: 1.3.5
patsy: 0.4.0
dateutil: 2.5.2
pytz: 2016.3
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.6.2
feather: None
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.8.4
lxml: 3.5.0
bs4: 4.4.1
html5lib: None
sqlalchemy: 1.0.11
pymysql: None
psycopg2: None
jinja2: 2.8
s3fs: None
pandas_gbq: None
pandas_datareader: None

performance regression in ewm.corr(pairwise=True) (original) (raw)

Problem description

Output of pd.show_versions()

INSTALLED VERSIONS

Output of `pd.show_versions()`