BUG: unwanted type conversion when partial reassigning · Issue #50467 · pandas-dev/pandas (original) (raw)

Pandas version checks

Reproducible Example

from typing import Union

import numpy as np import pandas as pd

def get_sample_df(): return pd.DataFrame( [ ["1", np.nan], ["2", np.nan], ["3", np.nan], ["4", np.nan] ], dtype="object" # Reproduce the dataframe I'm working with. )

def try_cast_int(s: str) -> Union[int, str]: # some complex preprocessing here # I removed those to make my code shorter try: return int(s) except ValueError: return s

df = get_sample_df()

Due to business reasons, apply try_cast_int only to the top half of df.

df.iloc[:2, :] = df.iloc[:2, :].applymap(try_cast_int)

Issue Description

Hello pandas team!

I found a bug(?) today. I expect that the df after the last line in the "Reproducible Example" will be a dataframe like below.

0 1
0 1 np.NaN
1 2 np.NaN
2 3 np.NaN
3 4 np.NaN

But what I got was like below. I have no idea why I got float such like 1.0 and 2.0 instead of int.

0 1
0 1.0 np.NaN
1 2.0 np.NaN
2 3 np.NaN
3 4 np.NaN

One interesting fact is df.iloc[:2, :].applymap(try_cast_int) before reassigning will return a dataframe like below.

0 1
0 1 np.NaN
1 2 np.NaN

It seems that integers on the 1st column are converted into float values when partial reassigning.

My questions are

Expected Behavior

The df after the last line in the "Reproducible Example" will be a dataframe like below.

0 1
0 1 np.NaN
1 2 np.NaN
2 3 np.NaN
3 4 np.NaN

I guess that "apply try_cast_int to the top half of df" is a cause of this issue. Pandas are not designed to do this kind of task, right?


I found that I can avoid this behavior and get what I want by following below steps.

  1. spit the df into 2 before applying try_cast_int
  2. apply try_cast_int to the top half of the df
  3. No processing is done on the bottom half of the df
  4. concat them into 1

example code

df = get_sample_df()

df_top = df.iloc[:2, :] df_top = df_top.applymap(try_cast_int)

df_bottom = df.iloc[2:, :]

df_full = pd.concat([df_top, df_bottom], axis=0)

Best of luck.

Installed Versions

INSTALLED VERSIONS

commit : 66e3805
python : 3.8.16.final.0
python-bits : 64
OS : Linux
OS-release : 5.10.133+
Version : #1 SMP Fri Aug 26 08:44:51 UTC 2022
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.3.5
numpy : 1.21.6
pytz : 2022.6
dateutil : 2.8.2
pip : 21.1.3
setuptools : 57.4.0
Cython : 0.29.32
pytest : 3.6.4
hypothesis : None
sphinx : 1.8.6
blosc : None
feather : 0.4.1
xlsxwriter : None
lxml.etree : 4.9.2
html5lib : 1.0.1
pymysql : None
psycopg2 : 2.9.5 (dt dec pq3 ext lo64)
jinja2 : 2.11.3
IPython : 7.9.0
pandas_datareader: 0.9.0
bs4 : 4.6.3
bottleneck : None
fsspec : 2022.11.0
fastparquet : None
gcsfs : None
matplotlib : 3.2.2
numexpr : 2.8.4
odfpy : None
openpyxl : 3.0.10
pandas_gbq : 0.17.9
pyarrow : 9.0.0
pyxlsb : None
s3fs : None
scipy : 1.7.3
sqlalchemy : 1.4.45
tables : 3.7.0
tabulate : 0.8.10
xarray : 2022.12.0
xlrd : 1.2.0
xlwt : 1.3.0
numba : 0.56.4