BUG/API: Indexes on empty frames/series should be RangeIndex by topper-123 · Pull Request #49637 · pandas-dev/pandas (original) (raw)

I've looked into #49572 and it seems that the fix is reasonable simple, see included code. In short, currently if the user hasn't supplied index or columns values, then the index/columns is a RangeIndex if the data has lenght > 0, while they're Index[object] if the length is 0. After this PR, a RangeIndex will be used in both cases. This will simplify type inference etc. testing etc.

However, this fix requires changes to a lot of tests (>500), of which I've fixed about 100 (not included in this version of the PR), so it's quite a lot of work to get this fixed up. Test fixing so far seems to be only a matter of ensuring that various empty dataframes/Series have RangeIndex rather than a Index[object], so relatively simple, but also tedious work.

I didn't get much response to #49572 from the core devs, so before I spend more time on this rather big item, could you guys chime in on if you agree that this a good thing to do? IMO having the same index dtypes for series/frame of length 0 and >0 will conceptually simplify pandas and decrease the amount of surprises that users will encounter.

I've included the code that fixes the issue in this PR, as mentioned, and ATM this PR fails, because I haven't fixed the tests, but I will get them fixed if I get positive response. If you can respond either here or in #49572, I'll of course read them both.