BUG: Nullable integer dtype leads to a fragmentation PerformanceWarning · Issue #44098 · pandas-dev/pandas (original) (raw)

Reproducible Example

import numpy as np import pandas as pd

df = pd.DataFrame(np.random.randint(0,100,size=(100, 100)), dtype='Int32') df.reset_index()

Issue Description

When using the new nullable Integer dtype, pandas seams to generate one IntegerArray per columns.
This will inevitably leads to a fragmentation performance warning while using more than 100 columns.

df = pd.DataFrame(np.random.randint(0,100,size=(100, 100)), dtype='Int32') df.reset_index() # Warn

>>> PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead.  To get a de-fragmented frame, use `newframe = frame.copy()`
  df.reset_index()

Of course this is due to an high number of fragments:

Expected Behavior

Maybe we could have a proper nullable integer matrix representation

Installed Versions

INSTALLED VERSIONS ------------------ commit : 945c9edpython : 3.9.6.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.18363 machine : AMD64 processor : Intel64 Family 6 Model 85 Stepping 4, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : English_United States.1252

pandas : 1.3.4
numpy : 1.21.1
pytz : 2021.1
dateutil : 2.8.2
pip : 21.2.4
setuptools : 56.0.0
Cython : None
pytest : 6.2.4
hypothesis : None
sphinx : 4.1.2
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.1
IPython : 7.26.0
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : 2021.07.0
fastparquet : None
gcsfs : None
matplotlib : 3.4.2
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 5.0.0
pyxlsb : None
s3fs : None
scipy : 1.7.1
sqlalchemy : 1.4.22
tables : None
tabulate : None
xarray : 0.19.0
xlrd : None
xlwt : None
numba : None