PERF: Improve perf initalizing DataFrame with a range by topper-123 · Pull Request #30171 · pandas-dev/pandas (original) (raw)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like some care is needed here in respect to dtypes. Specifically if the range contains values only supported by uint64, or values only supported by Python integers.

For example, the following works on master:

In [2]: pd.DataFrame(range(263, 263 + 4)) Out[2]: 0 0 9223372036854775808 1 9223372036854775809 2 9223372036854775810 3 9223372036854775811

In [3]: _.dtypes Out[3]: 0 uint64 dtype: object

In [4]: pd.DataFrame(range(273, 273 + 4)) Out[4]: 0 0 9444732965739290427392 1 9444732965739290427393 2 9444732965739290427394 3 9444732965739290427395

In [5]: _.dtypes Out[5]: 0 object dtype: object

But both fail with the changes in this PR:

In [2]: pd.DataFrame(range(263, 263 + 4))

OverflowError: Python int too large to convert to C long

In [3]: pd.DataFrame(range(273, 273 + 4))

OverflowError: Python int too large to convert to C long

Admittedly, this is a bit of a corner case. It also looks like the issue isn't limited to the PR, as the Series equivalent of the above fails on master.