BUG/PERF: offsets.apply doesnt preserve nanosecond by sinhrks · Pull Request #7697 · pandas-dev/pandas (original) (raw)
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Conversation27 Commits1 Checks0 Files changed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
[ Show hidden characters]({{ revealButtonHref }})
Main Fix is to preserve nanosecond info which can lost during offset.apply
, but it also includes:
- Support dateutil timezone
- Little performance improvement. Even though v0.14.1 should take longer than v0.14.0 because perf test in v0.14 doesn't perform timestamp conversion which was fixed in BUG: offsets.apply may return datetime #7502.
NOTE: This cachesTick.delta
because it was calculated 3 times repeatedly, but does it cause any side effect?
Before
-------------------------------------------------------------------------------
Test name | head[ms] | base[ms] | ratio |
-------------------------------------------------------------------------------
timeseries_year_incr | 0.0164 | 0.0103 | 1.5846 |
timeseries_year_apply | 0.0153 | 0.0094 | 1.6356 |
timeseries_day_incr | 0.0187 | 0.0053 | 3.5075 |
timeseries_day_apply | 0.0164 | 0.0033 | 4.9048 |
Target [d0076db] : PERF: Improve index.min and max perf
Base [da0f7ae] : RLS: 0.14.0 final
After the fix
-------------------------------------------------------------------------------
Test name | head[ms] | base[ms] | ratio |
-------------------------------------------------------------------------------
timeseries_year_incr | 0.0150 | 0.0087 | 1.7339 |
timeseries_year_apply | 0.0126 | 0.0073 | 1.7283 |
timeseries_day_incr | 0.0130 | 0.0053 | 2.4478 |
timeseries_day_apply | 0.0107 | 0.0033 | 3.2143 |
Target [64dd021] : BUG: offsets.apply doesnt preserve nanosecond
Base [da0f7ae] : RLS: 0.14.0 final
@@ -26,15 +27,17 @@ |
---|
# convert to/from datetime/timestamp to allow invalid Timestamp ranges to pass thru |
def as_timestamp(obj): |
if isinstance(obj, Timestamp): |
return obj |
if type(obj) == date: |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this be isinstance(obj, date)
? not sure if you need to include (np.datetime64,datetime,date)
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because `datetimeis a subclass of
date``. Modified to use``isinstance``.
>>> dt = datetime.datetime(2011, 1, 1)
>>> isinstance(dt, datetime.date)
True
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, Timestamp
can accept datetime.date
. Thus this check seems not to be required at all.
@jreback Rebased. Is any other thing required?
- can you show an example of where this fails (in current master)
- can you point to where
Tick
is calculated several times (currently)?
Affected offset (reset nanosecond)
If the aply logic includes datetime
conversion, nanosecond
will be lost.
- CustomBusinessDay
- CustomBusinessMonthEnd
- CustomBusinessMonthBegin
- MonthBegin
- BusinessMonthBegin
- MonthEnd
- BusinessMonthEnd
- YearBegin
- BYearBegin
- YearEnd
- BYearEnd
- QuarterBegin
- BQuarterBegin
Affected offset (dateutil support)
tz.localize
raises AttributeError: 'tzfile' object has no attribute 'localize'
if tz
is dateutil
timezone.
- BQuarterBegin
- QuarterEnd
- BQuarterEnd
- LastWeekOfMonth
- FY5253Quarter
- FY5253
- WeekOfMonth
- Easter
not sure what you mean (about Tick being calced more than once). The 2nd time is simply int_64 addition. AFAICT. delta_to_nanoseconds
is necessary. What do you think this should change to?
I think that delta
and nanos
could be cache_readonly
. Once you create an offset they are not changed (I don't think). You can try that (but separate PR). I am not sure how to test that, maybe just trace the code and see.
@sinhrks otherwise this looks ok. It just changes a lot of code so trying to review.
OK, modified to normal property
.
seems, apply_wraps
is on every apply, except for in DateOffset
. maybe add a note there why this is (or is it right?)
value = result.value |
---|
result = Timestamp(value + nano) |
if tz is not None and result.tzinfo is None: |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this be result= tslib._localize_pydatetime(result, tz)
as well here?
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because this is a flow for Timestamp
, no need to care for datetime
here.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, maybe add a note (or you can simply use the other routine). I found it confusing?
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. Modified to use tslib._localize_pydatetime(result, tz)
to avoid any confusion.
DateOffset
needs apply_wraps
. I missed because of misunderstanding that DateOffset
cannot be used by itself (#7375). Fixed and added tests.
jreback added a commit that referenced this pull request
BUG/PERF: offsets.apply doesnt preserve nanosecond
@sinhrks thanks for this...cleans up a large amount of code.....
2 participants