ENH: ExtensionEngine by jbrockmendel · Pull Request #45514 · pandas-dev/pandas (original) (raw)

(given that I asked for the status of this work, a small ping is always handy)

Repeating my question from on the original PR (#43930):

Can you give a bit more high-level context on how you approached it?
For example, for the NullableEngine (here only ExtensionEngine for now), you are currently not using any hash table. Did you look at that / decide that is not possible or desirable? Or is that a potential future improvement, and you are focusing first on getting it working with a base implementation?
The general ExtensionEngine seems to work with an actual ExtensionArray. An alternative could be to have it work an an ndarray that the EA could provide? Although a potential disadvantage of that approach is then that such an ndarray needs to be materialized always (in case this EA -> ndarray conversion is costly), while the the current way doesn't need that (but also cannot make use of existing optimized engines). So for the general ExtensionEngine, this is maybe a good approach. But are you also planning to do that for the NullableEngine?

This might also require some performance checking to investigate which approach is preferable. For example, I did a quick test comparing the performance of the ExtensionEngine (doesn't use hash-table) vs ObjectEngine (uses hash-table, which was used as fall-back before this PR) on one use case (get_indexer, which makes use of the hashtable if possible):

In [1]: idx_ea = pd.Index(np.arange(1_000_000), dtype="Int64")

In [2]: idx_object = idx_ea.astype(object)

In [4]: indexer = np.arange(500, 1000)

In [5]: %timeit idx_ea.get_indexer(indexer)
19 ms ± 418 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [6]: %timeit idx_object.get_indexer(indexer)
122 µs ± 4.43 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

So this is a case where the custom ExtensionEngine of this PR actually caused a big slowdown compared to the object engine fallback that we used before for Index[EA].
(now, I didn't profile this specifically, there might be some bottleneck in some EA method that gets called here, so there might also be other ways to improve the performance here instead of using a hashtable

(and specifically for the masked arrays as I used in this example, this might also be solvable in a specialized NullableEngine