ENH: allow storing ExtensionArrays in Index by jbrockmendel · Pull Request #43930 · pandas-dev/pandas (original) (raw)

The general ExtensionEngine seems to work with an actual ExtensionArray. An alternative could be to have it work an an ndarray that the EA could provide? Although a potential disadvantage of that approach is then that such an ndarray needs to be materialized always (in case this EA -> ndarray conversion is costly), while the the current way doesn't need that (but also cannot make use of existing optimized engines).

I chose to keep the EA intact instead of casting to ndarray bc most cases (for get_loc, the main concern) can use EA methods (searchsorted, __eq__) directly.

I suspect many cases will be able to use NDArrayBackedExtensionArray, in which case using one of the non-EA engines would be nice. I haven't implemented a way of doing that.

For example, for the NullableEngine, you are currently not using any hash table. Did you look at that / decide that is not possible or desirable? Or is that a potential future improvement, and you are focusing first on getting it working with a base implementation?

Right, first trying to get everything working, then will look at optimizations. (Also for NullableEngine.get_loc at least I have a different optimization in mind I want to try first).

For reviewability: I suppose that in theory some of the changes in Index class are mostly for allowing to store an EA in the Index, somewhat independent of the engine changes, and thus could be done separately?

Yep, a bunch of my recent PRs have been exactly that. More coming up, e.g. eq_NA_compat fixes problems with Index[object] containing pd.NA (though the function needs to be re-written) so i'll break that off before long. Also the float16 check in FloatingArray and the isna check in testing.pyx.

the nullable dtypes could use an object dtype array for the engine (if that's not buggy with NA), and that could then also work for starting the Index implementation and tests).

ATM the NullableEngine isn't a pain point. The remaining test failures are mostly in setops (xref #44000) and value_counts ordering.