ENH: NumericIndex for any numpy int/uint/float dtype by topper-123 · Pull Request #41153 · pandas-dev/pandas (original) (raw)
Thanks for the comments. My views:
I think the same issue arises when using NumIndex ? (also in that case isinstance(idx, Int64Index) won't work anymore). But I assume that the idea is that pd.Index(..., dtype="int16") would still return Int64Index for now, and you would need to use > pd.NumIndex explicitly for getting an index with int16 values?
Yes, that was my idea. That would keep this addition 100 % backwards compatible. Users who want e.g. int16, would do the explicit pd.NumIndex(..., dtype="int16").
That indeed gives backwards compatibility, but also limits quite a bit the usage of this improvement. For example, also doing a set_index with an int32 column would still convert the data to int64.
I agree, but that is a trade-off between keeping compatability and improving the architecture. Not easy, but I think this is a relatively large change, so my preference is keeping backward compat untill pandas 2.0.
There might also be a somewhat "in-between" option: already implement it in the Index class, but still have a very shallow Int64Index subclass to ensure such isinstance checks keep working.
We could then in theory even use the Int64Index class to store all integer bit-sizes, so that we could already stop converting to int64, while not breaking code that expects theInt64Indexclass (although the index.dtype would still change, but I would assume less people to strictly rely on that being "int64").
This would be easy to implement. Most people wouldn't be affected, but there will always be someone who do checks by doing e.g. issubclass(idx.dtype.type, np.int64), which would break. Do we accept that?
So if we change how Int64Index or the base Index work, it would have to be a concius decision to accept some breakage. My own view is that I prefer keeping backwards compat, as I mentioned above. I wouldn't mind accelerating the release of pandas 2.0, so we can accept breaking changes to the index classes, but I suspect others may not agree with that.
BTW, I like the suggestion from @jreback to do this stepwise. That would also make working on this more "relaxed", (The exact public API wouldn't have to be finished in this one PR).
For example, the NumIndex class doesn't actually need to be in the public namespace right away. Could we agree that this PR is internal only (not available outside of pd.core)? That would make it available for e.g. merging and groupbys internally, which would be a win already. The public API could come in a later PR, either as:
- new
Int32Indexetc. classes, - more flexible existing classes (
Int64Indexcan hold int32 dtypes etc.), - a new public
NumIndexclass or - merge this functionality into the base
Index?
No matter which choice is made, this PR will be a step in the right direction.
I could open an issue, where we further discuss the choice for public API.