BUG: replace fails with IndexError when regex parameter is passed a dictionary with multiple items · Issue #39338 · pandas-dev/pandas (original) (raw)
- I have checked that this issue has not already been reported.
- I have confirmed this bug exists on the latest version of pandas.
- (optional) I have confirmed this bug exists on the master branch of pandas.
Code Sample, a copy-pastable example
import pandas as pd import numpy as np
data = df = pd.DataFrame({ 'a_str' : ['A1','A2','A3'], 'b_int' : ['1,000','200','3'], 'c_str' : ['C1','C2','C3'], 'd_date' : ['2021-01-01','','2021-03-03']}) non_string_columns = ['b_int','d_date']
df[non_string_columns] = df[non_string_columns].replace(regex={'':np.nan,',':''})
Traceback (most recent call last): File "c:\Users\donder.vscode\extensions\ms-python.python-2021.1.502429796\pythonFiles\lib\python\debugpy_vendored\pydevd_pydevd_bundle\pydevd_vars.py", line 416, in evaluate_expression compiled = compile(_expression_to_evaluate(expression), '', 'eval') File "", line 1 df[non_string_columns] = df[non_string_columns].replace(regex={'':np.nan,',':''}) ^ SyntaxError: invalid syntax
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "", line 1, in File "E:\Python\Python38\lib\site-packages\pandas\core\frame.py", line 4521, in replace return super().replace( File "E:\Python\Python38\lib\site-packages\pandas\core\generic.py", line 6842, in replace return self.replace( File "E:\Python\Python38\lib\site-packages\pandas\core\frame.py", line 4521, in replace return super().replace( File "E:\Python\Python38\lib\site-packages\pandas\core\generic.py", line 6891, in replace new_data = self._mgr.replace_list( File "E:\Python\Python38\lib\site-packages\pandas\core\internals\managers.py", line 664, in replace_list bm = self.apply( File "E:\Python\Python38\lib\site-packages\pandas\core\internals\managers.py", line 427, in apply applied = getattr(b, f)(**kwargs) File "E:\Python\Python38\lib\site-packages\pandas\core\internals\blocks.py", line 901, in _replace_list result = blk._replace_coerce( File "E:\Python\Python38\lib\site-packages\pandas\core\internals\blocks.py", line 1643, in _replace_coerce return self._replace_regex( File "E:\Python\Python38\lib\site-packages\pandas\core\internals\blocks.py", line 844, in _replace_regex replace_regex(new_values, rx, value, mask) File "E:\Python\Python38\lib\site-packages\pandas\core\array_algos\replace.py", line 133, in replace_regex values[mask] = f(values[mask]) IndexError: boolean index did not match indexed array along dimension 0; dimension is 1 but corresponding boolean dimension is 2
Problem description
In the dataframe columns specified by the non_string_column variable, the above code should be replacing all the blanks with np.nan and removing all the commas.
If I change the replace to look like the following, the dataframe is correct.
df[non_string_columns] = df[non_string_columns].replace(regex={'':np.nan}) df[non_string_columns] = df[non_string_columns].replace(regex={',':''})
Expected Output
I expected the dataframe to look like the following:
a_str b_int c_str d_date 0 A1 1000 C1 2021-01-01 1 A2 200 C2 NaN 2 A3 3 C3 2021-03-03
I get this error whether I am using Python 3.8.3 or 3.9.1. Also, the numpy version under Python 3.9.1 is 1.19.5.
Output of pd.show_versions()
INSTALLED VERSIONS
commit : 9d598a5
python : 3.8.3.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.17763
machine : AMD64
processor : Intel64 Family 6 Model 58 Stepping 0, GenuineIntel
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : English_United States.1252
pandas : 1.2.1
numpy : 1.18.2
pytz : 2020.1
dateutil : 2.8.1
pip : 20.3.3
setuptools : 41.2.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.18.1
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : 1.3.15
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None
None