ENH: Basis for a StringDtype using Arrow by xhochy · Pull Request #35259 · pandas-dev/pandas (original) (raw)

@xhochy Over the last few days, I've began getting acquainted with fletcher and the string array work in arrow. At this point I'm not sure where the best place to coordinate would be.

I think it would also be beneficial to contribute to fletcher as part of this exercise. Since the different string methods are separate issues in fletcher, could also coordinate there on specific methods.

I would like to keep fletcher separate but try to minimise its codebase over time. Its main reason of development initially was to give input into Arrow development. For example all the numba-based algorithms should vanish over time and be replaced by their C++ counterparts in Arrow, pandas should only use the Arrow ones.

The string-related and general purpose things from https://github.com/xhochy/fletcher/blob/master/fletcher/base.py are probably the bits that we need to copy&paste&polish in this PR here. That should already bring us to a working but not ultra-fast dtype. I hope that we need nothing from the algorithms/ folder anymore.

Otherwise, once everything here is implemented, there is still a place for fletcher. It will behave slightly different than the pandas.ArrowStringDtype in that it will return for all its results an Arrow-backed Series. I think the dtype here should return the standard pandas-nullable dtypes.