Custom-business-days offsets very slow · Issue #6584 · pandas-dev/pandas (original) (raw)
The custom-business-days are currently significantly slower (around factor 4) compared to pd.offsets.BusinessDay(). Without actually specifying any custom business days:
import datetime as dt
import pandas as pd
date = dt.datetime(2011,1,1)
cday = pd.offsets.CustomBusinessDay()
%timeit date + pd.offsets.BusinessDay() # 6.59 µs
%timeit date + cday # 26.1 µs
Profiling pd.offsets.CustomBusinessDay.apply shows that only around 13% of the time is spent in np.busday_offset. The majority of time is spent casting the dates from datetime to datetime64 etc.
I'm not so familiar with the code but one idea would be to work in datetime by default and try to stick with it as much as possible. The method could then look something like this:
def apply(self, other):
if not isinstance(other,datetime):
other = other.astype(dt.datetime)
if self.n <= 0:
roll = 'forward'
else:
roll = 'backward'
dt_str = other.strftime('%Y-%m-%d')
result = np.busday_offset(dt_str,self.n,roll=roll,busdaycal=self.busdaycalendar)
result = result.astype(dt.datetime)
if not self.normalize:
result = dt.datetime.combine(result,other.time())
if self.offset:
print self.offsets
result = result + self.offset
return result
While this might not be a perfect comparison because I left out some conversion code, the changes yield a sizable speedup.
%timeit date + cday # 14.9 µs
Ultimately I would like to have a UsBday
offset. The code looks like this:
import datetime as dt
import numpy as np
import pandas as pd
class UsBday(object):
def __init__(self):
self.gen_cal()
def gen_cal(self):
holidays = []
for hlday in ['new_years','mlk_day',
'presidents_day','good_friday',
'memorial_day','independence_day','
labor_day','thanksgiving','xmas_holiday']:
hlday_func = getattr(self,hlday)
tmp_holidays = [ hlday_func(year) for year in
range(1950,dt.datetime.today().year+2) ]
holidays.extend(tmp_holidays)
self.cal = np.busdaycalendar(holidays=holidays)
self.bday_offset = pd.offsets.CustomBusinessDay(holidays=holidays)
def nxt_bday(self,dt):
nxt_bday = np.busday_offset(
dt.strftime('%Y-%m-%d'),offsets=0,
roll='forward',busdaycal=self.cal)
return nxt_bday
# - holiday definitions - #
@staticmethod
def new_years(year):
cand = dt.datetime(year,1,1)
dt_str = cand.strftime('%Y-%m-%d')
res = np.busday_offset(dt_str,offsets=0,roll='forward')
return res
@staticmethod
def mlk_day(year):
# third monday in January
dt_str = dt.datetime(year,1,1).strftime('%Y-%m-%d')
res = np.busday_offset(dt_str, 2, roll='forward', weekmask='Mon')
return res
@staticmethod
def presidents_day(year):
# third monday February
dt_str = dt.datetime(year,2,1).strftime('%Y-%m-%d')
res = np.busday_offset(dt_str, 2, roll='forward', weekmask='Mon')
return res
@staticmethod
def good_friday(year):
from dateutil.easter import easter
easter_sun = easter(year)
gr_fr = easter_sun - 2 * pd.offsets.Day()
return gr_fr.strftime('%Y-%m-%d')
@staticmethod
def memorial_day(year):
# final Monday of May
dt_str = dt.datetime(year,1,1).strftime('%Y-%m-%d')
res = np.busday_offset(dt_str, -1, roll='forward', weekmask='Mon')
return res
@staticmethod
def independence_day(year):
# July 4th
cand = dt.datetime(year,7,4).strftime('%Y-%m-%d')
return cand
@staticmethod
def labor_day(year):
# first Monday in September
res = np.busday_offset(str(year) + '-09',offsets=0,
roll='forward',weekmask='Mon')
return res
@staticmethod
def thanksgiving(year):
# fourth thursday in november
res = np.busday_offset(str(year) + '-11',offsets=-1,
roll='forward',weekmask='Thu')
return res
@staticmethod
def xmas_holiday(year):
cand = dt.datetime(year,12,25)
while not np.is_busday(cand.strftime('%Y-%m-%d')):
if cand.weekday() == 6:
cand += pd.offsets.BDay()
elif cand.weekday() == 5:
cand -= pd.offsets.BDay()
else:
cand += pd.offsets.BDay()
return cand.strftime('%Y-%m-%d')
us_bday = UsBday()
us_offset = .us_bday.bday_offset
%timeit date + us_offset: 26.1 µs
# Using numpy directly
%timeit us_bday.nxt_bday(date) # 7.85 µs
My first intuition when I noticed that custom business days are slower was that this is due to the large list of holidays passed to numpy. The timings at the end of the code block, however, show that adding a custom business day with realistic holidays does not alter the performance by much. The main speed difference results from interfacing with numpy and is therefore a Pandas issue.
I know that CustomBusinessDays is in experimental mode. I hope this feedback can help improve it because I think it is an important feature in Pandas. Also perhaps it would be nice to ship certain custom calendars, for example for the US, directly with Pandas.