BUG: .str.startswith(..., na=False) consistency between categorical and string series (again) · Issue #36241 · pandas-dev/pandas (original) (raw)

Code Sample, a copy-pastable example

import pandas as pd import numpy as np s = pd.Series(["a", "b", np.nan], index=["a", "b", np.nan], dtype="category") df = pd.DataFrame().assign( ... cat_contains=s.str.contains("a", na=False), ... cat_startswith=s.str.startswith("a", na=False), ... cat_endswith=s.str.endswith("a", na=False), ... str_contains=s.astype("string").str.contains("a", na=False), ... str_startswith=s.astype("string").str.startswith("a", na=False), ... str_endswith=s.astype("string").str.endswith("a", na=False), ... ) df.info() <class 'pandas.core.frame.DataFrame'> Index: 3 entries, a to nan Data columns (total 6 columns):

Column Non-Null Count Dtype

0 cat_contains 3 non-null bool 1 cat_startswith 2 non-null object 2 cat_endswith 2 non-null object 3 str_contains 3 non-null boolean 4 str_startswith 3 non-null boolean 5 str_endswith 3 non-null boolean dtypes: bool(1), boolean(3), object(2) memory usage: 93.0+ bytes

df cat_contains cat_startswith cat_endswith str_contains str_startswith str_endswith a True True True True True True b False False False False False False NaN False NaN NaN False False False

Problem description

.str.startswith(..., na=False) and .str.endswith should make missing values False when the calling series is of type categorical just like it does for string series.

Similar to #22158, but .str.contains works here.

Expected Output

df cat_contains cat_startswith cat_endswith str_contains str_startswith str_endswith a True True True True True True b False False False False False False NaN False False False False False False

Output of `pd.show_versions()`

Using conda env with conda create -n pandas112 -c conda-forge pandas=1.1.2

INSTALLED VERSIONS

commit : 2a7d332
python : 3.8.5.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19041
machine : AMD64
processor : Intel64 Family 6 Model 142 Stepping 10, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : Korean_Korea.949

pandas : 1.1.2
numpy : 1.19.1
pytz : 2020.1
dateutil : 2.8.1
pip : 20.2.3
setuptools : 49.6.0.post20200814
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None