[READY] Improved performance of Period
's default formatter (period_format
) by smarie · Pull Request #51459 · pandas-dev/pandas (original) (raw)
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Conversation48 Commits53 Checks0 Files changed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
[ Show hidden characters]({{ revealButtonHref }})
Period
's default formatter (period_format
) is now significantly (~twice) faster.
- This improves performance of
str(Period)
,repr(Period)
, andPeriod.strftime(fmt=None)
, - as well as
PeriodArray.strftime(fmt=None)
,PeriodIndex.strftime(fmt=None)
andPeriodIndex.format(fmt=None)
. - Finally,
to_csv
operations involvingPeriodArray
orPeriodIndex
with defaultdate_format
are also significantly accelerated.
ASVs:
> asv compare d82f9ddb c1bbfd6c
before after ratio
[d82f9ddb] [c1bbfd6c]
- 3.12±0.05μs 998±40ns 0.32 tslibs.period.PeriodUnaryMethods.time_repr('M')
- 3.31±0.05μs 1.58±0.1μs 0.48 tslibs.period.PeriodUnaryMethods.time_repr('min')
- 3.15±0.09μs 1.05±0.07μs 0.33 tslibs.period.PeriodUnaryMethods.time_str('M')
- 3.20±0.05μs 1.75±0.1μs 0.55 tslibs.period.PeriodUnaryMethods.time_str('min')
- 2.90±0.1μs 850±40ns 0.29 tslibs.period.PeriodUnaryMethods.time_strftime_default('M')
- 3.01±0.01μs 1.48±0.1μs 0.49 tslibs.period.PeriodUnaryMethods.time_strftime_default('min')
- 4.16±0.05ms 2.37±0.08ms 0.57 strftime.PeriodStrftime.time_frame_period_formatting_default(1000, 'D')
- 4.35±0.06ms 2.63±0.1ms 0.60 strftime.PeriodStrftime.time_frame_period_formatting_default(1000, 'H')
- 40.3±0.6ms 21.8±0.2ms 0.54 strftime.PeriodStrftime.time_frame_period_formatting_default(10000, 'D')
- 43.5±0.2ms 26.3±1ms 0.60 strftime.PeriodStrftime.time_frame_period_formatting_default(10000, 'H')
4.58±0.08ms 2.88±0.07ms ~0.63 strftime.PeriodStrftime.time_frame_period_formatting_index_default(1000, 'D')
4.73±0.2ms 3.02±0.05ms ~0.64 strftime.PeriodStrftime.time_frame_period_formatting_index_default(1000, 'H')
- 42.8±0.3ms 24.0±0.5ms 0.56 strftime.PeriodStrftime.time_frame_period_formatting_index_default(10000, 'D')
- 43.2±0.3ms 27.3±0.3ms 0.63 strftime.PeriodStrftime.time_frame_period_formatting_index_default(10000, 'H')
- 4.16±0.05ms 2.41±0.3ms 0.58 strftime.PeriodStrftime.time_frame_period_to_str(1000, 'D')
- 4.28±0.03ms 2.71±0.1ms 0.63 strftime.PeriodStrftime.time_frame_period_to_str(1000, 'H')
- 41.6±0.7ms 22.7±0.2ms 0.55 strftime.PeriodStrftime.time_frame_period_to_str(10000, 'D')
- 42.9±0.6ms 26.0±0.5ms 0.61 strftime.PeriodStrftime.time_frame_period_to_str(10000, 'H')
EDIT: additional for to_csv
- 5.94±0.05ms 3.94±0.3ms 0.66 io.csv.ToCSVPeriod.time_frame_period_formatting(1000, 'D')
- 6.39±0.1ms 4.37±0.04ms 0.68 io.csv.ToCSVPeriod.time_frame_period_formatting(1000, 'H')
- 51.8±0.2ms 30.2±0.3ms 0.58 io.csv.ToCSVPeriod.time_frame_period_formatting(10000, 'D')
- 59.8±20ms 35.0±0.2ms 0.59 io.csv.ToCSVPeriod.time_frame_period_formatting(10000, 'H')
- 7.09±0.2ms 3.94±0.2ms 0.56 io.csv.ToCSVPeriod.time_frame_period_no_format(1000, 'D')
- 6.72±0.1ms 4.30±0.03ms 0.64 io.csv.ToCSVPeriod.time_frame_period_no_format(1000, 'H')
- 54.0±2ms 30.5±0.3ms 0.56 io.csv.ToCSVPeriod.time_frame_period_no_format(10000, 'D')
- 54.4±1ms 35.3±0.5ms 0.65 io.csv.ToCSVPeriod.time_frame_period_no_format(10000, 'H')
9.94±0.3ms 9.70±0.1ms 0.98 io.csv.ToCSVPeriodIndex.time_frame_period_formatting_index(1000, 'D')
9.69±0.07ms 9.84±0.7ms 1.02 io.csv.ToCSVPeriodIndex.time_frame_period_formatting_index(1000, 'H')
85.4±0.9ms 84.9±0.9ms 0.99 io.csv.ToCSVPeriodIndex.time_frame_period_formatting_index(10000, 'D')
89.3±2ms 87.3±3ms 0.98 io.csv.ToCSVPeriodIndex.time_frame_period_formatting_index(10000, 'H')
- 6.32±0.3ms 3.78±0.03ms 0.60 io.csv.ToCSVPeriodIndex.time_frame_period_no_format_index(1000, 'D')
- 6.30±0.2ms 4.60±0.4ms 0.73 io.csv.ToCSVPeriodIndex.time_frame_period_no_format_index(1000, 'H')
50.9±2ms 33.9±8ms ~0.66 io.csv.ToCSVPeriodIndex.time_frame_period_no_format_index(10000, 'D')
52.6±1ms 35.6±6ms ~0.68 io.csv.ToCSVPeriodIndex.time_frame_period_no_format_index(10000, 'H')
- closes one part of PERF: strftime is slow #44764
- All code checks passed.
- Added an entry in the latest
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.
Sylvain MARIE added 3 commits
…Added corresponding ASVs
…ure/44764_perf_issue_new_period
It seems that the failing tests are not related to this PR's contents, do you confirm ?
Sylvain MARIE added 3 commits
I added a few asvs for the to_csv operations and updated the what's new. This is now ready for review @MarcoGorelli
Note that I do not understand why ToCSVPeriod.time_frame_period_formatting
is accelerated too, since it has a custom date_format. Maybe there is some default period formatting operation happening as an intermediate step ? To investigate... in another PR maybe
I found the reason and added a docstring on that bench function. The date_format
argument is not used by to_csv
when the PeriodArray
is not the index of the dataframe, so the formatting uses the default format, and therefore it is impacted by the acceleration.
…t was improved: the date_format parameter is not taken into account today.
Note: the failing test seeems to be related to a network issue, so completely unrelated.
FAILED pandas/tests/io/parser/common/test_file_buffer_url.py::test_file_descriptor_leak[c_high] - AssertionError: ([], [pconn(fd=14, family=<AddressFamily.AF_INET6: 10>, type=<SocketKind.SOCK_STREAM: 1>, laddr=addr(ip='::1', port=35970), raddr=addr(ip='::1', port=3306), status='ESTABLISHED')])
@mroeschke since you have reviewed my similar PRs for strftime in the past, do not hesitate to jump in too.
Note that once this PR will be merged, #51298 will be finally small enough to be revewed
smarie changed the title
Improved performance of [READY] Improved performance of Period
's default formatter (period_format
)Period
's default formatter (period_format
)
…ure/44764_perf_issue_new_period
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice one, the changes themselves look good to me! Just got a question
Regarding date_format
only taking effect for Period
if it's in the index - that looks like a bug to me, well done for noticing! I don't see any issue open about this, do you want to open one? (else I can, no worries)
Comment on lines 78 to 86
def time_frame_period_formatting_custom(self, obs, fq): |
---|
self.data["p"].dt.strftime(date_format="%Y-%m-%d --- %H:%M:%S") |
def time_frame_period_formatting_iso8601_strftime_Z(self, obs, fq): |
self.data["p"].dt.strftime(date_format="%Y-%m-%dT%H:%M:%SZ") |
def time_frame_period_formatting_iso8601_strftime_offset(self, obs, fq): |
"""Not optimized yet as %z is not supported by `convert_strftime_format`""" |
self.data["p"].dt.strftime(date_format="%Y-%m-%dT%H:%M:%S%z") |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do these not show up in the asv results you posted?
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point, it seems that I forgot them in the copy paste. I'll try to rerun asv partially to get them
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EDIT: Actually I had removed all asvs that were expected not to show improvements (custom date formats). Here are all ASVs for PeriodStrftime
:
15.4±0.2ms 17.1±2ms ~1.11 strftime.PeriodStrftime.time_frame_period_formatting_custom(1000, 'D')
15.0±0.7ms 14.1±0.2ms 0.94 strftime.PeriodStrftime.time_frame_period_formatting_custom(1000, 'H')
142±6ms 153±5ms 1.08 strftime.PeriodStrftime.time_frame_period_formatting_custom(10000, 'D')
143±5ms 157±7ms 1.10 strftime.PeriodStrftime.time_frame_period_formatting_custom(10000, 'H')
- 4.37±0.07ms 2.24±0.06ms 0.51 strftime.PeriodStrftime.time_frame_period_formatting_default(1000, 'D')
- 4.72±0.06ms 2.90±0.1ms 0.61 strftime.PeriodStrftime.time_frame_period_formatting_default(1000, 'H')
- 42.8±1ms 24.1±0.7ms 0.56 strftime.PeriodStrftime.time_frame_period_formatting_default(10000, 'D')
- 48.4±0.5ms 24.4±0.2ms 0.50 strftime.PeriodStrftime.time_frame_period_formatting_index_default(10000, 'D')
- 50.9±3ms 27.1±0.7ms 0.53 strftime.PeriodStrftime.time_frame_period_formatting_index_default(10000, 'H')
7.52±0.7ms 7.22±0.3ms 0.96 strftime.PeriodStrftime.time_frame_period_formatting_iso8601_strftime_Z(1000, 'D')
7.12±0.1ms 7.00±0.3ms 0.98 strftime.PeriodStrftime.time_frame_period_formatting_iso8601_strftime_Z(1000, 'H')
75.2±4ms 68.1±1ms ~0.91 strftime.PeriodStrftime.time_frame_period_formatting_iso8601_strftime_Z(10000, 'D')
71.9±4ms 77.3±8ms 1.08 strftime.PeriodStrftime.time_frame_period_formatting_iso8601_strftime_Z(10000, 'H')
13.5±0.5ms 14.0±0.4ms 1.03 strftime.PeriodStrftime.time_frame_period_formatting_iso8601_strftime_offset(1000, 'D')
15.9±3ms 14.4±1ms ~0.91 strftime.PeriodStrftime.time_frame_period_formatting_iso8601_strftime_offset(1000, 'H')
135±3ms 148±7ms 1.10 strftime.PeriodStrftime.time_frame_period_formatting_iso8601_strftime_offset(10000, 'D')
140±3ms 136±2ms 0.97 strftime.PeriodStrftime.time_frame_period_formatting_iso8601_strftime_offset(10000, 'H')
Thanks @MarcoGorelli !
Regarding date_format only taking effect for Period if it's in the index - that looks like a bug to me, well done for noticing! I don't see any issue open about this, do you want to open one? (else I can, no worries)
Done #51621
Sylvain MARIE added 2 commits
…ure/44764_perf_issue_new_period
Conflicts:
doc/source/whatsnew/v1.5.4.rst
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, thanks @smarie !
Leaving open for a bit in case others have comments
yup taking a look now (was off last week)
return f"{dts.year}" |
---|
elif freq_group == FR_QTR and (is_fmt_none or fmt == "%FQ%q"): |
# get quarter and modify dts.year to be the fiscal year (?) |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the changes themselves look good, I'm just confused by this (?)
. are you asking for confirmation?
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes @MarcoGorelli , "fiscal year" is the wording to confirm in this comment, I wasn't sure that this was the right definition.
get_yq
docstring is not very explicit about this, apart from stating Sets dts.year in-place.
Maybe this docstring could be improved according to the actual answer.
cdef int get_yq(int64_t ordinal, int freq, npy_datetimestruct* dts): |
---|
Sylvain MARIE added 4 commits
…ure/44764_perf_issue_new_period
Conflicts:
doc/source/whatsnew/v2.1.0.rst
…iod' into feature/44764_perf_issue_new_period
…ure/44764_perf_issue_new_period
ok should be good - I'll give it a final look and merge this week
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've had another look, and to be honest I'm slightly scared about merging something this user-facing
What I'm most concerned about is that not all these reprs seems to be explicitly tested for. Some of them are, e.g.
def test_millisecond_repr(self): |
---|
p = Period("2000-01-01 12:15:02.123") |
assert repr(p) == "Period('2000-01-01 12:15:02.123', 'L')" |
def test_microsecond_repr(self): |
p = Period("2000-01-01 12:15:02.123567") |
assert repr(p) == "Period('2000-01-01 12:15:02.123567', 'U')" |
but then FR_SEC
isn't (as far as I can tell).
Reckon we should add such unit tests for each period group (perhaps in a separate PR, which we could get in very quickly)? That would give me much more confidence about merging this
Really sorry this is taking so long
Reckon we should add such unit tests for each period group (perhaps in a separate PR, which we could get in very quickly)? That would give me much more confidence about merging this
I did not review the available tests extensively so maybe this is indeed needed. I'll have a look one of these days
This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.
Yes this is still active and even ready for merge, but it would require that another PR pushes a few more tests first so that @MarcoGorelli is more confident about the non-regressive impact :)
…ure/44764_perf_issue_new_period
Conflicts:
doc/source/whatsnew/v2.1.0.rst
Now that #53003 is merged this should be ok - all new tests seem to pass.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool, let's get this in, thanks @smarie !
Great news @MarcoGorelli thanks !
I'll now finally move back to the main PR of this thread: #51298
Stay tuned ;)
Thanks again !
Rylie-W pushed a commit to Rylie-W/pandas that referenced this pull request
…_format`) (pandas-dev#51459)
Improved performance of default period formatting (
period_format
). Added corresponding ASVsImproved ASV for period frames and datetimes
What's new
Update asv_bench/benchmarks/strftime.py
Fixed whats new backticks
Completed whatsnew
Added ASVs for to_csv for period
Aligned the namings
Completed Whats new
Added a docstring explaining why the ASV bench with custom date format was improved: the date_format parameter is not taken into account today.
Moved whatsnew to 2.0.0
Moved whatsnew to 2.1
Improved docstring as per code review
Renamed asv params as per code review
Fixed ASV comment as per code review
ASV: renamed parameters as per code review
Improved
period_format
: now the performance is the same when no format is provided and if an explicitfmt
is provided matching the actual default format for thisfreq
Code review: Improved strftime ASV: set_index is now in the setup
Removed useless main
Removed wrong code
Improved ASVs for period formatting: now there is a "default explicit" everywhere
Update pandas/_libs/tslibs/period.pyx
Update pandas/_libs/tslibs/period.pyx
Update pandas/_libs/tslibs/period.pyx
Minor refactoring to avoid retesting for none several time
Fixed issue: bool does not exist, using bint
Added missing quarter variable as cdef
Fixed asv bug
Code review: fixed docstring
Update doc/source/whatsnew/v2.1.0.rst
fixup
Co-authored-by: Sylvain MARIE sylvain.marie@se.com Co-authored-by: Marco Edward Gorelli 33491632+MarcoGorelli@users.noreply.github.com Co-authored-by: MarcoGorelli <>
smarie deleted the feature/44764_perf_issue_new_period branch
Daquisu pushed a commit to Daquisu/pandas that referenced this pull request
…_format`) (pandas-dev#51459)
Improved performance of default period formatting (
period_format
). Added corresponding ASVsImproved ASV for period frames and datetimes
What's new
Update asv_bench/benchmarks/strftime.py
Fixed whats new backticks
Completed whatsnew
Added ASVs for to_csv for period
Aligned the namings
Completed Whats new
Added a docstring explaining why the ASV bench with custom date format was improved: the date_format parameter is not taken into account today.
Moved whatsnew to 2.0.0
Moved whatsnew to 2.1
Improved docstring as per code review
Renamed asv params as per code review
Fixed ASV comment as per code review
ASV: renamed parameters as per code review
Improved
period_format
: now the performance is the same when no format is provided and if an explicitfmt
is provided matching the actual default format for thisfreq
Code review: Improved strftime ASV: set_index is now in the setup
Removed useless main
Removed wrong code
Improved ASVs for period formatting: now there is a "default explicit" everywhere
Update pandas/_libs/tslibs/period.pyx
Update pandas/_libs/tslibs/period.pyx
Update pandas/_libs/tslibs/period.pyx
Minor refactoring to avoid retesting for none several time
Fixed issue: bool does not exist, using bint
Added missing quarter variable as cdef
Fixed asv bug
Code review: fixed docstring
Update doc/source/whatsnew/v2.1.0.rst
fixup
Co-authored-by: Sylvain MARIE sylvain.marie@se.com Co-authored-by: Marco Edward Gorelli 33491632+MarcoGorelli@users.noreply.github.com Co-authored-by: MarcoGorelli <>