PERF: Improve perf initalizing DataFrame with a range by topper-123 · Pull Request #30171 · pandas-dev/pandas (original) (raw)
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like some care is needed here in respect to dtypes. Specifically if the range contains values only supported by uint64, or values only supported by Python integers.
For example, the following works on master:
In [2]: pd.DataFrame(range(263, 263 + 4)) Out[2]: 0 0 9223372036854775808 1 9223372036854775809 2 9223372036854775810 3 9223372036854775811
In [3]: _.dtypes Out[3]: 0 uint64 dtype: object
In [4]: pd.DataFrame(range(273, 273 + 4)) Out[4]: 0 0 9444732965739290427392 1 9444732965739290427393 2 9444732965739290427394 3 9444732965739290427395
In [5]: _.dtypes Out[5]: 0 object dtype: object
But both fail with the changes in this PR:
In [2]: pd.DataFrame(range(263, 263 + 4))
OverflowError: Python int too large to convert to C long
In [3]: pd.DataFrame(range(273, 273 + 4))
OverflowError: Python int too large to convert to C long
Admittedly, this is a bit of a corner case. It also looks like the issue isn't limited to the PR, as the Series equivalent of the above fails on master.