BUG: Integer Capacity Higher Than Necessary in Mappings from Pandas to SQLAlchemy Types · Issue #35076 · pandas-dev/pandas (original) (raw)

Problem description

Currently, to map the Pandas data type to a SQLAlchemy type: https://github.com/pandas-dev/pandas/blob/master/pandas/io/sql.py#L1066-L1069, the code reads:

elif col_type == "integer": if col.dtype == "int32": return Integer else: return BigInteger

This means integers of a capacity less than 32-bit are written to the database as if they were 64-bit. An example where one might get smaller integers is through the use of pd.to_numeric and downcast=True.

Since there is already a check for the col_type == "integer", I think switching the col.dtype check to this could be a possible fix:

elif col_type == "integer": if col.dtype == "int64": return BigInteger else: return Integer

But I'm not sure how to get started with an official PR or if this is a sane thing to do.

For context (this part is not an issue for Pandas, just explaining my interest in this issue), I discovered this when using the df.to_sql method to persist datasets to a Postgres database. From there, I use Postgraphile to auto-generate a GraphQL endpoint. I found that the BigInteger type ends up resolving as a string, because Javascript can't represent such large numbers safely. This would be fine if the source data warranted the high-capacity, but for me it often doesn't.

Expected Output

That the column types that get written to a new table in the database more tightly match their data types in Pandas.

Workarounds

If using df.to_sql, one could set the dtype argument but this can be unruly when there are many columns in the DataFrame.

For now, I have this helper method to downcast integer columns to their lowest capacity, but then back to 32-bit to persist them to the database with the desired column type:

def cast_to_int32(df): # Downcast to the lowest possible representation. for col in df.select_dtypes(include=['int']).columns.values: df[col] = pd.to_numeric(df[col], downcast='integer') # Upcast back to 32-bit (since that's what gets persisted correctly) for col in df.select_dtypes(include=['int8', 'int16']).columns.values: df[col] = df[col].astype('int32') return df

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.8.2.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.0-39-generic
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.0.5
numpy : 1.19.0
pytz : 2020.1
dateutil : 2.8.1
pip : 20.0.2
setuptools : 44.0.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.8.5 (dt dec pq3 ext lo64)
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pytest : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : 1.3.17
tables : None
tabulate : 0.8.7
xarray : None
xlrd : 1.2.0
xlwt : None
xlsxwriter : None
numba : None