[READY] perf improvements for strftime by smarie · Pull Request #51298 · pandas-dev/pandas (original) (raw)

Thanks @WillAyd .

I'd suggest getting alignment on the discussion and planning portion of a large, multi-file, API-changing initiative before diving in, as that makes the process more tenable for all parties involved

Agreed. Well, at first (as always ? :) ) it seemed like a straightforward, simple change to do. But as I discovered in the past 2 years implementing this (PR reviews started with @jbrockmendel and @mroeschke in 2022), many intermediate issues were hiding / smaller steps could be made. I therefore solved first #46361, #46759, #47570, #46405, #53003, #51459 .
In the meantime arrow became more popular and more integrated in pandas

Unfortunately even now, this PR is still large and impacts a lot of files. Surprisingly, this is not due to the fact that the proposed engine is custom - indeed the proposed engine is just a single small file (pandas/_libs/tslibs/strftime.py). It is due to the fact that pandas implementation of datetime and period formatting is still quite layered and with a few design inconsistencies. It is still much better than a couple months ago, as the team (@mroeschke I believe) simplified many useless alternate formatting functions.

Here is a global picture of files impacted by this PR

image

In addition, 8 test files, 3 asv benchmark files, and a couple init/api/meson files are updated.

As you can hopefully see if you use this picture to navigate the PR contents, adding the pyarrow engine would not change much of the complexity here - hence the PR would still be hard to review.

Based on this map, could we define together a target implementation strategy ? I am ready to break this PR into bits if you believe that this would make it easier to review.
Alternately if you think that pandas' timestamp and period objects will soon be replaced with their arrow equivalent (and therefore all the above structures will be removed from pandas), then indeed maybe there is no need to merge anything apart from the ASV benchmarks, some of the tests, and maybe the CSVFormatter fix :)

Note that the "soon" word is important in the above sentence. Indeed I like pragmatism: if we can bring significant performance improvements to users now, it is probably better than waiting years expecting a future "big refactoring" (replacement of all pandas internals with pyarrow's).

Thanks again a lot for the time you dedicate to this