pandas (original) (raw)

Thanks for the comments. My views:

I think the same issue arises when using NumIndex ? (also in that case isinstance(idx, Int64Index) won't work anymore). But I assume that the idea is that pd.Index(..., dtype="int16") would still return Int64Index for now, and you would need to use > pd.NumIndex explicitly for getting an index with int16 values?

Yes, that was my idea. That would keep this addition 100 % backwards compatible. Users who want e.g. int16, would do the explicit pd.NumIndex(..., dtype="int16").

That indeed gives backwards compatibility, but also limits quite a bit the usage of this improvement. For example, also doing a set_index with an int32 column would still convert the data to int64.

I agree, but that is a trade-off between keeping compatability and improving the architecture. Not easy, but I think this is a relatively large change, so my preference is keeping backward compat untill pandas 2.0.

There might also be a somewhat "in-between" option: already implement it in the Index class, but still have a very shallow Int64Index subclass to ensure such isinstance checks keep working.
We could then in theory even use the Int64Index class to store all integer bit-sizes, so that we could already stop converting to int64, while not breaking code that expects the Int64Index class (although the index.dtype would still change, but I would assume less people to strictly rely on that being "int64").

This would be easy to implement. Most people wouldn't be affected, but there will always be someone who do checks by doing e.g. issubclass(idx.dtype.type, np.int64), which would break. Do we accept that?

So if we change how Int64Index or the base Index work, it would have to be a concius decision to accept some breakage. My own view is that I prefer keeping backwards compat, as I mentioned above. I wouldn't mind accelerating the release of pandas 2.0, so we can accept breaking changes to the index classes, but I suspect others may not agree with that.

BTW, I like the suggestion from @jreback to do this stepwise. That would also make working on this more "relaxed", (The exact public API wouldn't have to be finished in this one PR).

For example, the NumIndex class doesn't actually need to be in the public namespace right away. Could we agree that this PR is internal only (not available outside of pd.core)? That would make it available for e.g. merging and groupbys internally, which would be a win already. The public API could come in a later PR, either as:

new Int32Index etc. classes,
more flexible existing classes (Int64Index can hold int32 dtypes etc.),
a new public NumIndex class or
merge this functionality into the base Index?

No matter which choice is made, this PR will be a step in the right direction.

I could open an issue, where we further discuss the choice for public API.