BUG: Collection of inconsistencies in .astype conversions · Issue #37626 · pandas-dev/pandas (original) (raw)
I have a use case where (automatic) casting between the following pandas
dtypes is necessary;
bool
, boolean
, int64
, Int64
, float64
, object
and string
.
Note that boolean
, Int64
and string
are the new pandas 1.0 nullable dtypes.
The default approach for this would be series.astype(target_dtype)
, for target_dtype
one of the above dtypes as strings. This works (given no issues with missings) but for the inconsistencies below:
In [1]: import pandas as pd ...: import numpy as np
pd.NA
to float
:
Summary: Casting float(pd.NA)
raises a TypeError
(as does casting float(None)
). While np.array([None, "1"]).astype("float")
works (and gets called here), the same call with pd.NA
fails.
In [2]: pd.Series(["1", "2", pd.NA], dtype="object").astype("float") TypeError: float() argument must be a string or a number, not 'NAType'
In [3]: pd.Series(["1", "2", pd.NA], dtype="string").astype("float") TypeError: float() argument must be a string or a number, not 'NAType'
In [4]: pd.Series(["1", "2", pd.NA], dtype="string").astype("category").astype("float") # magic workaround Out[4]: 0 1.0 1 2.0 2 NaN dtype: float64
Edit: Fixed with #37974.
object
/ string
to Int64
(nullable)
Summary: object
columns cannot be casted to Int64
. Casting
object
->string
->Int64
works.object
->float
->Int64
works if the data does not containpd.NA
(see above)object
->int64
->Int64
works if the data does not contain missings.
In [5]: pd.Series(["1", "2", "3"]).astype("Int64") TypeError: object cannot be converted to an IntegerDtype
In [5]: pd.Series(["1", "2", "3"], dtype="string").astype("Int64") Out[5]: 0 1 1 2 2 3 dtype: Int64
In [6]: pd.Series(["1", "2", "3"]).astype("int").astype("Int64") Out[6]: 0 1 1 2 2 3 dtype: Int64
In [7]: pd.Series(["1", "2", None]).astype("int").astype("Int64") # int columns cannot contain missings TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'
In [8]: pd.Series(["1", "2", None]).astype("float").astype("Int64") # magic workaround Out[8]: 0 1 1 2 2 dtype: Int64
In [9]: pd.Series(["1", "2", pd.NA], dtype="object").astype("float").astype("Int64") # see pd.NA -> float TypeError: float() argument must be a string or a number, not 'NAType'
In [10]: pd.Series(["1", "2", pd.NA], dtype="object").astype("string").astype("Int64") Out[10]: 0 1 1 2 2 dtype: Int64
Related: #25472 (comment)
string
/ object
to bool
or boolean
Summary: Casting string
or object
columns to bool
or boolean
behaves strangely. I am not sure what the expected behaviour for string
/ object
to bool
/ boolean
should be. It would be nice to have consistent behaviour.
string
/object
->bool
works if there are no missings, but yields onlyTrue
string
/object
->boolean
raises
In [11]: pd.Series(["True", "False", "bogus"], dtype="string").astype("bool") # everything is True Out[11]: 0 True 1 True 2 True dtype: bool
In [12]: pd.Series(["True", "False", "bogus"], dtype="object").astype("bool") Out[12]: 0 True 1 True 2 True dtype: bool
In [13]: pd.Series(["True", "False", "bogus"], dtype="string").astype("boolean") TypeError: data type not understood
In [14]: pd.Series(["True", "False", "bogus"], dtype="object").astype("boolean") TypeError: Need to pass bool-like values
int
(non-nullable) to boolean
Summary: Casting from (non-nullable) int64
to (nullable) boolean
raises.
int64
->Int64
->boolean
worksint64
->bool
->boolean
works as long as there are no missings.
In [15] pd.Series([-1, 0, 1], dtype="int").astype("boolean") TypeError: Need to pass bool-like values
In [16]: pd.Series([-1, 0, 1], dtype="Int64").astype("boolean") Out[16]: 0 True 1 False 2 True dtype: boolean
In [17]: pd.Series([-1, 0, 1], dtype="int").astype("bool").astype("boolean") Out[17]: 0 True 1 False 2 True dtype: boolean
Related: #37614
While there exist separate issues for the first and last report, I gathered that it might be nice to have a collection of these somewhere, which I did not find.
- I have checked that this issue has not already been reported.
- I have confirmed this bug exists on the latest version of pandas.
- (optional) I have confirmed this bug exists on the master branch of pandas.
Output of pd.show_versions()
INSTALLED VERSIONS
------------------
commit : 15f843ab102d7a0cd7f1c7870dfec72d0e28d252
python : 3.8.6.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.0-52-generic
Version : #57~18.04.1-Ubuntu SMP Thu Oct 15 14:04:49 UTC 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.2.0.dev0+1059.g15f843ab1.dirty
numpy : 1.18.5
pytz : 2020.4
dateutil : 2.8.1
pip : 20.2.4
setuptools : 49.6.0.post20201009
Cython : 0.29.21
pytest : 5.4.3
hypothesis : 5.41.1
sphinx : 3.1.1
blosc : None
feather : None
xlsxwriter : 1.3.7
lxml.etree : 4.6.1
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.19.0
pandas_datareader: None
bs4 : 4.9.3
bottleneck : 1.3.2
fsspec : 0.8.4
fastparquet : 0.4.1
gcsfs : 0.7.1
matplotlib : 3.2.2
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.5
pandas_gbq : None
pyarrow : 2.0.0
pyxlsb : None
s3fs : 0.4.2
scipy : 1.5.3
sqlalchemy : 1.3.20
tables : 3.6.1
tabulate : 0.8.7
xarray : 0.16.1
xlrd : 1.2.0
xlwt : 1.3.0
numba : 0.51.2