BUG: Integer Capacity Higher Than Necessary in Mappings from Pandas to SQLAlchemy Types · Issue #35076 · pandas-dev/pandas (original) (raw)
- I have checked that this issue has not already been reported.
- I have confirmed this bug exists on the latest version of pandas.
- (optional) I have confirmed this bug exists on the master branch of pandas.
Problem description
Currently, to map the Pandas data type to a SQLAlchemy type: https://github.com/pandas-dev/pandas/blob/master/pandas/io/sql.py#L1066-L1069, the code reads:
elif col_type == "integer": if col.dtype == "int32": return Integer else: return BigInteger
This means integers of a capacity less than 32-bit are written to the database as if they were 64-bit. An example where one might get smaller integers is through the use of pd.to_numeric and downcast=True
.
Since there is already a check for the col_type == "integer"
, I think switching the col.dtype
check to this could be a possible fix:
elif col_type == "integer": if col.dtype == "int64": return BigInteger else: return Integer
But I'm not sure how to get started with an official PR or if this is a sane thing to do.
For context (this part is not an issue for Pandas, just explaining my interest in this issue), I discovered this when using the df.to_sql method to persist datasets to a Postgres database. From there, I use Postgraphile to auto-generate a GraphQL endpoint. I found that the BigInteger
type ends up resolving as a string, because Javascript can't represent such large numbers safely. This would be fine if the source data warranted the high-capacity, but for me it often doesn't.
Expected Output
That the column types that get written to a new table in the database more tightly match their data types in Pandas.
Workarounds
If using df.to_sql, one could set the dtype
argument but this can be unruly when there are many columns in the DataFrame.
For now, I have this helper method to downcast integer columns to their lowest capacity, but then back to 32-bit to persist them to the database with the desired column type:
def cast_to_int32(df): # Downcast to the lowest possible representation. for col in df.select_dtypes(include=['int']).columns.values: df[col] = pd.to_numeric(df[col], downcast='integer') # Upcast back to 32-bit (since that's what gets persisted correctly) for col in df.select_dtypes(include=['int8', 'int16']).columns.values: df[col] = df[col].astype('int32') return df
Output of pd.show_versions()
INSTALLED VERSIONS
commit : None
python : 3.8.2.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.0-39-generic
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.0.5
numpy : 1.19.0
pytz : 2020.1
dateutil : 2.8.1
pip : 20.0.2
setuptools : 44.0.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.8.5 (dt dec pq3 ext lo64)
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pytest : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : 1.3.17
tables : None
tabulate : 0.8.7
xarray : None
xlrd : 1.2.0
xlwt : None
xlsxwriter : None
numba : None