BUG: Collection of inconsistencies in .astype conversions · Issue #37626 · pandas-dev/pandas (original) (raw)

I have a use case where (automatic) casting between the following pandas dtypes is necessary;

bool, boolean, int64, Int64, float64, object and string.

Note that boolean, Int64 and string are the new pandas 1.0 nullable dtypes.

The default approach for this would be series.astype(target_dtype), for target_dtype one of the above dtypes as strings. This works (given no issues with missings) but for the inconsistencies below:

In [1]: import pandas as pd ...: import numpy as np

pd.NA to float:

Summary: Casting float(pd.NA) raises a TypeError (as does casting float(None)). While np.array([None, "1"]).astype("float") works (and gets called here), the same call with pd.NA fails.

In [2]: pd.Series(["1", "2", pd.NA], dtype="object").astype("float") TypeError: float() argument must be a string or a number, not 'NAType'

In [3]: pd.Series(["1", "2", pd.NA], dtype="string").astype("float") TypeError: float() argument must be a string or a number, not 'NAType'

In [4]: pd.Series(["1", "2", pd.NA], dtype="string").astype("category").astype("float") # magic workaround Out[4]: 0 1.0 1 2.0 2 NaN dtype: float64

Edit: Fixed with #37974.

object / string to Int64 (nullable)

Summary: object columns cannot be casted to Int64. Casting

In [5]: pd.Series(["1", "2", "3"]).astype("Int64") TypeError: object cannot be converted to an IntegerDtype

In [5]: pd.Series(["1", "2", "3"], dtype="string").astype("Int64") Out[5]: 0 1 1 2 2 3 dtype: Int64

In [6]: pd.Series(["1", "2", "3"]).astype("int").astype("Int64") Out[6]: 0 1 1 2 2 3 dtype: Int64

In [7]: pd.Series(["1", "2", None]).astype("int").astype("Int64") # int columns cannot contain missings TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'

In [8]: pd.Series(["1", "2", None]).astype("float").astype("Int64") # magic workaround Out[8]: 0 1 1 2 2 dtype: Int64

In [9]: pd.Series(["1", "2", pd.NA], dtype="object").astype("float").astype("Int64") # see pd.NA -> float TypeError: float() argument must be a string or a number, not 'NAType'

In [10]: pd.Series(["1", "2", pd.NA], dtype="object").astype("string").astype("Int64") Out[10]: 0 1 1 2 2 dtype: Int64

Related: #25472 (comment)

string / object to bool or boolean

Summary: Casting string or object columns to bool or boolean behaves strangely. I am not sure what the expected behaviour for string / object to bool / boolean should be. It would be nice to have consistent behaviour.

In [11]: pd.Series(["True", "False", "bogus"], dtype="string").astype("bool") # everything is True Out[11]: 0 True 1 True 2 True dtype: bool

In [12]: pd.Series(["True", "False", "bogus"], dtype="object").astype("bool") Out[12]: 0 True 1 True 2 True dtype: bool

In [13]: pd.Series(["True", "False", "bogus"], dtype="string").astype("boolean") TypeError: data type not understood

In [14]: pd.Series(["True", "False", "bogus"], dtype="object").astype("boolean") TypeError: Need to pass bool-like values

int (non-nullable) to boolean

Summary: Casting from (non-nullable) int64 to (nullable) boolean raises.

In [15] pd.Series([-1, 0, 1], dtype="int").astype("boolean") TypeError: Need to pass bool-like values

In [16]: pd.Series([-1, 0, 1], dtype="Int64").astype("boolean") Out[16]: 0 True 1 False 2 True dtype: boolean

In [17]: pd.Series([-1, 0, 1], dtype="int").astype("bool").astype("boolean") Out[17]: 0 True 1 False 2 True dtype: boolean

Related: #37614

While there exist separate issues for the first and last report, I gathered that it might be nice to have a collection of these somewhere, which I did not find.

Output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit           : 15f843ab102d7a0cd7f1c7870dfec72d0e28d252
python           : 3.8.6.final.0
python-bits      : 64
OS               : Linux
OS-release       : 5.4.0-52-generic
Version          : #57~18.04.1-Ubuntu SMP Thu Oct 15 14:04:49 UTC 2020
machine          : x86_64
processor        : x86_64
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 1.2.0.dev0+1059.g15f843ab1.dirty
numpy            : 1.18.5
pytz             : 2020.4
dateutil         : 2.8.1
pip              : 20.2.4
setuptools       : 49.6.0.post20201009
Cython           : 0.29.21
pytest           : 5.4.3
hypothesis       : 5.41.1
sphinx           : 3.1.1
blosc            : None
feather          : None
xlsxwriter       : 1.3.7
lxml.etree       : 4.6.1
html5lib         : 1.1
pymysql          : None
psycopg2         : None
jinja2           : 2.11.2
IPython          : 7.19.0
pandas_datareader: None
bs4              : 4.9.3
bottleneck       : 1.3.2
fsspec           : 0.8.4
fastparquet      : 0.4.1
gcsfs            : 0.7.1
matplotlib       : 3.2.2
numexpr          : 2.7.1
odfpy            : None
openpyxl         : 3.0.5
pandas_gbq       : None
pyarrow          : 2.0.0
pyxlsb           : None
s3fs             : 0.4.2
scipy            : 1.5.3
sqlalchemy       : 1.3.20
tables           : 3.6.1
tabulate         : 0.8.7
xarray           : 0.16.1
xlrd             : 1.2.0
xlwt             : 1.3.0
numba            : 0.51.2