[READY] Improved performance of Period's default formatter (period_format) by smarie · Pull Request #51459 · pandas-dev/pandas (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation48 Commits53 Checks0 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

smarie

Period's default formatter (period_format) is now significantly (~twice) faster.

ASVs:

> asv compare d82f9ddb c1bbfd6c
       before           after         ratio
     [d82f9ddb]       [c1bbfd6c]
-     3.12±0.05μs         998±40ns     0.32  tslibs.period.PeriodUnaryMethods.time_repr('M')
-     3.31±0.05μs       1.58±0.1μs     0.48  tslibs.period.PeriodUnaryMethods.time_repr('min')
-     3.15±0.09μs      1.05±0.07μs     0.33  tslibs.period.PeriodUnaryMethods.time_str('M')
-     3.20±0.05μs       1.75±0.1μs     0.55  tslibs.period.PeriodUnaryMethods.time_str('min')
-      2.90±0.1μs         850±40ns     0.29  tslibs.period.PeriodUnaryMethods.time_strftime_default('M')
-     3.01±0.01μs       1.48±0.1μs     0.49  tslibs.period.PeriodUnaryMethods.time_strftime_default('min')
-     4.16±0.05ms      2.37±0.08ms     0.57  strftime.PeriodStrftime.time_frame_period_formatting_default(1000, 'D')
-     4.35±0.06ms       2.63±0.1ms     0.60  strftime.PeriodStrftime.time_frame_period_formatting_default(1000, 'H')
-      40.3±0.6ms       21.8±0.2ms     0.54  strftime.PeriodStrftime.time_frame_period_formatting_default(10000, 'D')
-      43.5±0.2ms         26.3±1ms     0.60  strftime.PeriodStrftime.time_frame_period_formatting_default(10000, 'H')
      4.58±0.08ms      2.88±0.07ms    ~0.63  strftime.PeriodStrftime.time_frame_period_formatting_index_default(1000, 'D')
       4.73±0.2ms      3.02±0.05ms    ~0.64  strftime.PeriodStrftime.time_frame_period_formatting_index_default(1000, 'H')
-      42.8±0.3ms       24.0±0.5ms     0.56  strftime.PeriodStrftime.time_frame_period_formatting_index_default(10000, 'D')
-      43.2±0.3ms       27.3±0.3ms     0.63  strftime.PeriodStrftime.time_frame_period_formatting_index_default(10000, 'H')
-     4.16±0.05ms       2.41±0.3ms     0.58  strftime.PeriodStrftime.time_frame_period_to_str(1000, 'D')
-     4.28±0.03ms       2.71±0.1ms     0.63  strftime.PeriodStrftime.time_frame_period_to_str(1000, 'H')
-      41.6±0.7ms       22.7±0.2ms     0.55  strftime.PeriodStrftime.time_frame_period_to_str(10000, 'D')
-      42.9±0.6ms       26.0±0.5ms     0.61  strftime.PeriodStrftime.time_frame_period_to_str(10000, 'H')

EDIT: additional for to_csv

-     5.94±0.05ms       3.94±0.3ms     0.66  io.csv.ToCSVPeriod.time_frame_period_formatting(1000, 'D') 
-      6.39±0.1ms      4.37±0.04ms     0.68  io.csv.ToCSVPeriod.time_frame_period_formatting(1000, 'H') 
-      51.8±0.2ms       30.2±0.3ms     0.58  io.csv.ToCSVPeriod.time_frame_period_formatting(10000, 'D')
-       59.8±20ms       35.0±0.2ms     0.59  io.csv.ToCSVPeriod.time_frame_period_formatting(10000, 'H')
-      7.09±0.2ms       3.94±0.2ms     0.56  io.csv.ToCSVPeriod.time_frame_period_no_format(1000, 'D')  
-      6.72±0.1ms      4.30±0.03ms     0.64  io.csv.ToCSVPeriod.time_frame_period_no_format(1000, 'H')  
-        54.0±2ms       30.5±0.3ms     0.56  io.csv.ToCSVPeriod.time_frame_period_no_format(10000, 'D') 
-        54.4±1ms       35.3±0.5ms     0.65  io.csv.ToCSVPeriod.time_frame_period_no_format(10000, 'H')
       9.94±0.3ms       9.70±0.1ms     0.98  io.csv.ToCSVPeriodIndex.time_frame_period_formatting_index(1000, 'D')
      9.69±0.07ms       9.84±0.7ms     1.02  io.csv.ToCSVPeriodIndex.time_frame_period_formatting_index(1000, 'H')
       85.4±0.9ms       84.9±0.9ms     0.99  io.csv.ToCSVPeriodIndex.time_frame_period_formatting_index(10000, 'D')
         89.3±2ms         87.3±3ms     0.98  io.csv.ToCSVPeriodIndex.time_frame_period_formatting_index(10000, 'H')
-      6.32±0.3ms      3.78±0.03ms     0.60  io.csv.ToCSVPeriodIndex.time_frame_period_no_format_index(1000, 'D')
-      6.30±0.2ms       4.60±0.4ms     0.73  io.csv.ToCSVPeriodIndex.time_frame_period_no_format_index(1000, 'H')
         50.9±2ms         33.9±8ms    ~0.66  io.csv.ToCSVPeriodIndex.time_frame_period_no_format_index(10000, 'D')
         52.6±1ms         35.6±6ms    ~0.68  io.csv.ToCSVPeriodIndex.time_frame_period_no_format_index(10000, 'H')

Sylvain MARIE added 3 commits

February 17, 2023 09:15

…Added corresponding ASVs

smarie

@smarie

…ure/44764_perf_issue_new_period

@smarie

It seems that the failing tests are not related to this PR's contents, do you confirm ?

Sylvain MARIE added 3 commits

February 18, 2023 15:55

@smarie

I added a few asvs for the to_csv operations and updated the what's new. This is now ready for review @MarcoGorelli

Note that I do not understand why ToCSVPeriod.time_frame_period_formatting is accelerated too, since it has a custom date_format. Maybe there is some default period formatting operation happening as an intermediate step ? To investigate... in another PR maybe

I found the reason and added a docstring on that bench function. The date_format argument is not used by to_csv when the PeriodArray is not the index of the dataframe, so the formatting uses the default format, and therefore it is impacted by the acceleration.

…t was improved: the date_format parameter is not taken into account today.

@smarie

Note: the failing test seeems to be related to a network issue, so completely unrelated.

FAILED pandas/tests/io/parser/common/test_file_buffer_url.py::test_file_descriptor_leak[c_high] - AssertionError: ([], [pconn(fd=14, family=<AddressFamily.AF_INET6: 10>, type=<SocketKind.SOCK_STREAM: 1>, laddr=addr(ip='::1', port=35970), raddr=addr(ip='::1', port=3306), status='ESTABLISHED')])

@mroeschke since you have reviewed my similar PRs for strftime in the past, do not hesitate to jump in too.

Note that once this PR will be merged, #51298 will be finally small enough to be revewed

@smarie smarie changed the titleImproved performance of Period's default formatter (period_format) [READY] Improved performance of Period's default formatter (period_format)

Feb 21, 2023

…ure/44764_perf_issue_new_period

MarcoGorelli

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice one, the changes themselves look good to me! Just got a question

Regarding date_format only taking effect for Period if it's in the index - that looks like a bug to me, well done for noticing! I don't see any issue open about this, do you want to open one? (else I can, no worries)

Comment on lines 78 to 86

def time_frame_period_formatting_custom(self, obs, fq):
self.data["p"].dt.strftime(date_format="%Y-%m-%d --- %H:%M:%S")
def time_frame_period_formatting_iso8601_strftime_Z(self, obs, fq):
self.data["p"].dt.strftime(date_format="%Y-%m-%dT%H:%M:%SZ")
def time_frame_period_formatting_iso8601_strftime_offset(self, obs, fq):
"""Not optimized yet as %z is not supported by `convert_strftime_format`"""
self.data["p"].dt.strftime(date_format="%Y-%m-%dT%H:%M:%S%z")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do these not show up in the asv results you posted?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, it seems that I forgot them in the copy paste. I'll try to rerun asv partially to get them

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EDIT: Actually I had removed all asvs that were expected not to show improvements (custom date formats). Here are all ASVs for PeriodStrftime:

       15.4±0.2ms         17.1±2ms    ~1.11  strftime.PeriodStrftime.time_frame_period_formatting_custom(1000, 'D')
       15.0±0.7ms       14.1±0.2ms     0.94  strftime.PeriodStrftime.time_frame_period_formatting_custom(1000, 'H')
          142±6ms          153±5ms     1.08  strftime.PeriodStrftime.time_frame_period_formatting_custom(10000, 'D')
          143±5ms          157±7ms     1.10  strftime.PeriodStrftime.time_frame_period_formatting_custom(10000, 'H')
-     4.37±0.07ms      2.24±0.06ms     0.51  strftime.PeriodStrftime.time_frame_period_formatting_default(1000, 'D')
-     4.72±0.06ms       2.90±0.1ms     0.61  strftime.PeriodStrftime.time_frame_period_formatting_default(1000, 'H')
-        42.8±1ms       24.1±0.7ms     0.56  strftime.PeriodStrftime.time_frame_period_formatting_default(10000, 'D')
-      48.4±0.5ms       24.4±0.2ms     0.50  strftime.PeriodStrftime.time_frame_period_formatting_index_default(10000, 'D')
-        50.9±3ms       27.1±0.7ms     0.53  strftime.PeriodStrftime.time_frame_period_formatting_index_default(10000, 'H')
       7.52±0.7ms       7.22±0.3ms     0.96  strftime.PeriodStrftime.time_frame_period_formatting_iso8601_strftime_Z(1000, 'D')
       7.12±0.1ms       7.00±0.3ms     0.98  strftime.PeriodStrftime.time_frame_period_formatting_iso8601_strftime_Z(1000, 'H')
         75.2±4ms         68.1±1ms    ~0.91  strftime.PeriodStrftime.time_frame_period_formatting_iso8601_strftime_Z(10000, 'D')
         71.9±4ms         77.3±8ms     1.08  strftime.PeriodStrftime.time_frame_period_formatting_iso8601_strftime_Z(10000, 'H')
       13.5±0.5ms       14.0±0.4ms     1.03  strftime.PeriodStrftime.time_frame_period_formatting_iso8601_strftime_offset(1000, 'D')
         15.9±3ms         14.4±1ms    ~0.91  strftime.PeriodStrftime.time_frame_period_formatting_iso8601_strftime_offset(1000, 'H')
          135±3ms          148±7ms     1.10  strftime.PeriodStrftime.time_frame_period_formatting_iso8601_strftime_offset(10000, 'D')
          140±3ms          136±2ms     0.97  strftime.PeriodStrftime.time_frame_period_formatting_iso8601_strftime_offset(10000, 'H')

@smarie

Thanks @MarcoGorelli !

Regarding date_format only taking effect for Period if it's in the index - that looks like a bug to me, well done for noticing! I don't see any issue open about this, do you want to open one? (else I can, no worries)

Done #51621

Sylvain MARIE added 2 commits

February 28, 2023 13:43

…ure/44764_perf_issue_new_period

Conflicts:

doc/source/whatsnew/v1.5.4.rst

MarcoGorelli

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thanks @smarie !

Leaving open for a bit in case others have comments

@smarie

@smarie

@MarcoGorelli

yup taking a look now (was off last week)

MarcoGorelli

return f"{dts.year}"
elif freq_group == FR_QTR and (is_fmt_none or fmt == "%FQ%q"):
# get quarter and modify dts.year to be the fiscal year (?)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the changes themselves look good, I'm just confused by this (?). are you asking for confirmation?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes @MarcoGorelli , "fiscal year" is the wording to confirm in this comment, I wasn't sure that this was the right definition.

get_yq docstring is not very explicit about this, apart from stating Sets dts.year in-place.
Maybe this docstring could be improved according to the actual answer.

cdef int get_yq(int64_t ordinal, int freq, npy_datetimestruct* dts):

Sylvain MARIE added 4 commits

March 14, 2023 22:24

…ure/44764_perf_issue_new_period

Conflicts:

doc/source/whatsnew/v2.1.0.rst

…iod' into feature/44764_perf_issue_new_period

…ure/44764_perf_issue_new_period

@MarcoGorelli

ok should be good - I'll give it a final look and merge this week

@smarie

MarcoGorelli

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've had another look, and to be honest I'm slightly scared about merging something this user-facing

What I'm most concerned about is that not all these reprs seems to be explicitly tested for. Some of them are, e.g.

def test_millisecond_repr(self):
p = Period("2000-01-01 12:15:02.123")
assert repr(p) == "Period('2000-01-01 12:15:02.123', 'L')"
def test_microsecond_repr(self):
p = Period("2000-01-01 12:15:02.123567")
assert repr(p) == "Period('2000-01-01 12:15:02.123567', 'U')"

but then FR_SEC isn't (as far as I can tell).

Reckon we should add such unit tests for each period group (perhaps in a separate PR, which we could get in very quickly)? That would give me much more confidence about merging this

Really sorry this is taking so long

@smarie

Reckon we should add such unit tests for each period group (perhaps in a separate PR, which we could get in very quickly)? That would give me much more confidence about merging this

I did not review the available tests extensively so maybe this is indeed needed. I'll have a look one of these days

@github-actions

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@smarie

Yes this is still active and even ready for merge, but it would require that another PR pushes a few more tests first so that @MarcoGorelli is more confident about the non-regressive impact :)

…ure/44764_perf_issue_new_period

Conflicts:

doc/source/whatsnew/v2.1.0.rst

@smarie

Now that #53003 is merged this should be ok - all new tests seem to pass.

MarcoGorelli

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool, let's get this in, thanks @smarie !

MarcoGorelli

@MarcoGorelli

@MarcoGorelli

@smarie

Great news @MarcoGorelli thanks !
I'll now finally move back to the main PR of this thread: #51298
Stay tuned ;)

Thanks again !

Rylie-W pushed a commit to Rylie-W/pandas that referenced this pull request

May 19, 2023

@smarie @MarcoGorelli

…_format`) (pandas-dev#51459)


Co-authored-by: Sylvain MARIE sylvain.marie@se.com Co-authored-by: Marco Edward Gorelli 33491632+MarcoGorelli@users.noreply.github.com Co-authored-by: MarcoGorelli <>

@smarie smarie deleted the feature/44764_perf_issue_new_period branch

May 22, 2023 16:17

Daquisu pushed a commit to Daquisu/pandas that referenced this pull request

Jul 8, 2023

@smarie @MarcoGorelli

…_format`) (pandas-dev#51459)


Co-authored-by: Sylvain MARIE sylvain.marie@se.com Co-authored-by: Marco Edward Gorelli 33491632+MarcoGorelli@users.noreply.github.com Co-authored-by: MarcoGorelli <>