Custom-business-days offsets very slow · Issue #6584 · pandas-dev/pandas (original) (raw)

The custom-business-days are currently significantly slower (around factor 4) compared to pd.offsets.BusinessDay(). Without actually specifying any custom business days:

import datetime as dt

import pandas as pd

date = dt.datetime(2011,1,1)
cday = pd.offsets.CustomBusinessDay()
%timeit date + pd.offsets.BusinessDay() #  6.59 µs
%timeit date + cday                     #  26.1 µs

Profiling pd.offsets.CustomBusinessDay.apply shows that only around 13% of the time is spent in np.busday_offset. The majority of time is spent casting the dates from datetime to datetime64 etc.

I'm not so familiar with the code but one idea would be to work in datetime by default and try to stick with it as much as possible. The method could then look something like this:

    def apply(self, other):
        if not isinstance(other,datetime):
            other = other.astype(dt.datetime)

        if self.n <= 0:
            roll = 'forward'
        else:
            roll = 'backward'

        dt_str = other.strftime('%Y-%m-%d')
        result = np.busday_offset(dt_str,self.n,roll=roll,busdaycal=self.busdaycalendar)
        result = result.astype(dt.datetime)
        if not self.normalize:
            result = dt.datetime.combine(result,other.time())

        if self.offset:
            print self.offsets
            result = result + self.offset

        return result

While this might not be a perfect comparison because I left out some conversion code, the changes yield a sizable speedup.

%timeit date + cday      # 14.9 µs

Ultimately I would like to have a UsBday offset. The code looks like this:

import datetime as dt

import numpy as np
import pandas as pd

class UsBday(object):
    def __init__(self):
        self.gen_cal()

    def gen_cal(self):
        holidays = []
        for hlday in ['new_years','mlk_day',
                      'presidents_day','good_friday',
                      'memorial_day','independence_day','
                      labor_day','thanksgiving','xmas_holiday']:
            hlday_func = getattr(self,hlday)
            tmp_holidays = [ hlday_func(year) for year in 
                             range(1950,dt.datetime.today().year+2) ]
            holidays.extend(tmp_holidays)
        self.cal = np.busdaycalendar(holidays=holidays)
        self.bday_offset = pd.offsets.CustomBusinessDay(holidays=holidays)

    def nxt_bday(self,dt):
        nxt_bday = np.busday_offset(
            dt.strftime('%Y-%m-%d'),offsets=0,
            roll='forward',busdaycal=self.cal)
        return nxt_bday

    # - holiday definitions - #
    @staticmethod
    def new_years(year):
        cand = dt.datetime(year,1,1)
        dt_str = cand.strftime('%Y-%m-%d')
        res = np.busday_offset(dt_str,offsets=0,roll='forward')
        return res

    @staticmethod
    def mlk_day(year):
        # third monday in January
        dt_str = dt.datetime(year,1,1).strftime('%Y-%m-%d')
        res = np.busday_offset(dt_str, 2, roll='forward', weekmask='Mon')
        return res

    @staticmethod
    def presidents_day(year):
        # third monday February
        dt_str = dt.datetime(year,2,1).strftime('%Y-%m-%d')
        res = np.busday_offset(dt_str, 2, roll='forward', weekmask='Mon')
        return res

    @staticmethod
    def good_friday(year):
        from dateutil.easter import easter
        easter_sun = easter(year)
        gr_fr = easter_sun - 2 * pd.offsets.Day()
        return gr_fr.strftime('%Y-%m-%d')

    @staticmethod
    def memorial_day(year):
        # final Monday of May
        dt_str = dt.datetime(year,1,1).strftime('%Y-%m-%d')
        res = np.busday_offset(dt_str, -1, roll='forward', weekmask='Mon')
        return res

    @staticmethod
    def independence_day(year):
        # July 4th
        cand = dt.datetime(year,7,4).strftime('%Y-%m-%d')
        return cand

    @staticmethod
    def labor_day(year):
        # first Monday in September
        res = np.busday_offset(str(year) + '-09',offsets=0,
                               roll='forward',weekmask='Mon')
        return res

    @staticmethod
    def thanksgiving(year):
        # fourth thursday in november
        res = np.busday_offset(str(year) + '-11',offsets=-1,
                               roll='forward',weekmask='Thu')
        return res

    @staticmethod
    def xmas_holiday(year):
        cand = dt.datetime(year,12,25)
        while not np.is_busday(cand.strftime('%Y-%m-%d')):
            if cand.weekday() == 6:
                cand += pd.offsets.BDay()
            elif cand.weekday() == 5:
                cand -= pd.offsets.BDay()
            else:
                cand += pd.offsets.BDay()
        return cand.strftime('%Y-%m-%d')

us_bday = UsBday()
us_offset = .us_bday.bday_offset
%timeit date + us_offset: 26.1 µs

# Using numpy directly 
%timeit us_bday.nxt_bday(date) # 7.85 µs

My first intuition when I noticed that custom business days are slower was that this is due to the large list of holidays passed to numpy. The timings at the end of the code block, however, show that adding a custom business day with realistic holidays does not alter the performance by much. The main speed difference results from interfacing with numpy and is therefore a Pandas issue.

I know that CustomBusinessDays is in experimental mode. I hope this feedback can help improve it because I think it is an important feature in Pandas. Also perhaps it would be nice to ship certain custom calendars, for example for the US, directly with Pandas.