PERF: Make DateOffset (partially) immutable, cache _params for eq speedup by jbrockmendel · Pull Request #17137 · pandas-dev/pandas (original) (raw)
There may be a problem. A lot of cross-design appears to have built up over the years including at least two approaches to caching.
It looks like the original author had caching in mind, but it got dropped here. There is a _cacheable
attribute that is False
everywhere except one unused mixin class, and a _should_cache
that is referenced over in core.indexes.datetimes
but is apparently always False
.
AFAICT the idea in tseries.frequencies.get_offset
is that the user calls e.g. get_offset(B)
, gets a copy of the cached <BusinessDay>
offset, and then can modify if desired. My approach of making DateOffset
immutable would break that.
The motivation is that __eq__
is currently very slow. It is slow because it re-evaluates self._params()
on every call. I'm advocating caching self._params()
. Doing this requires ensuring that cache remains valid, so freezing all the relevant attributes. But those attributes are blacklist-based instead of whitelist-based:
def _params(self):
all_paras = dict(list(vars(self).items()) + list(self.kwds.items()))
if 'holidays' in all_paras and not all_paras['holidays']:
all_paras.pop('holidays')
exclude = ['kwds', 'name', 'normalize', 'calendar']
attrs = [(k, v) for k, v in all_paras.items()
if (k not in exclude) and (k[0] != '_')]
attrs = sorted(set(attrs))
params = tuple([str(self.__class__)] + attrs)
return params
(There is a TODO note further down suggesting this should be changed:
# TODO: Combine this with DateOffset by defining a whitelisted set of
# attributes on each object rather than the existing behavior of iterating
# over internal ``__dict__``
)
Making _params
cacheable for the general case would require basically deprecating the self.kwds
attribute, which is I'm reticent to do as users may be using it in creative ways. My inclination is to have _params
defer to a _cached_params
iff len(self.kwds) == 0
. Then __setattr__
can refuse to change a handful of attributes, but it doesn't need to be an exhaustive list.
I'm still not sure if this approach would play nicely with the frequencies.get_offset
cache.