BUG: maximum recursion error when replacing empty lists by minggli · Pull Request #22083 · pandas-dev/pandas (original) (raw)
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Conversation22 Commits22 Checks0 Files changed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
[ Show hidden characters]({{ revealButtonHref }})
- closes Pandas Series: maximum recursion error when replacing empty lists #21977
- closes Series/DataFrame.replace crashes without exception in some cases (concerning lists) #19266
- tests added / passed
- passes
git diff upstream/master -u -- "*.py" | flake8 --diff
- whatsnew entry
Internal mechanism in replace method is intertwined and can cause RecursionError (not very helpful). Discussed in #21977, there doesn't appear to be first class support for ExtensionArray yet and this issue is right now treated as error handling, having investigated treating it as new use case but not easily co-exist without large change.
The RecursionError is caused by Block.replace
and ObjectBlock.replace
calling each other under certain circumstances such as in the linked issues.
New approach is now to prevent re-casting of blocks to object blocks if it is already an ObjectBlock and instead raise the error caught as it is, because if it's an ObjectBlock already then ValueError and TypeError raised won't go away when re-casted to object and re-try, resulting in infinite loop and RecursionError.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add the example as a test
@@ -1000,8 +999,8 @@ def putmask(self, mask, new, align=True, inplace=False, axis=0, |
---|
if not (mask.shape[-1] == len(new) or |
mask[mask].shape[-1] == len(new) or |
len(new) == 1): |
raise ValueError("cannot assign mismatch " |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this need to change? (or is this what is actually causing the issue); this is not exposed to the user, correct?
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this ValueError caused Block.replace and ObjectBlock.replace to form infinite loop and ultimately raise RecursionError. It's because Block.replace catches exception include ValueError and if it fails, delegate replace to ObjectBlock. But ObjectBlock.replace also uses super(ObjectBlock, self), hence the loop. The exception handling includes TypeError and ValueError and they are both depended on (as appear to me) by other casting cases.
Open to suggestion, not sure a better way to handle, at the moment this change is needed to address the issue in question.
This can be exposed to user when:
import pandas as pd
s = pd.Series(range(10)) mask = s > 5
s[mask] = [5, 4, 3, 2, 1]
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you know why we are catching ValueError at a higher level, maybe can get away with only catching TypeError and let ValueError buble up?
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point and it appears below test (only this test) fails when we only catch TypeError:
_________________ TestDataFrameReplace.test_replace_datetimetz _________________
# coerce to object
result = df.copy()
result.iloc[1, 0] = np.nan
result = result.replace(
{'A': pd.NaT}, Timestamp('20130104', tz='US/Pacific'))pandas/tests/frame/test_replace.py:1060:
elif isinstance(other, (np.datetime64, datetime, date)): other = tslibs.Timestamp(other) tz = getattr(other, 'tz', None) # test we can have an equal time zone if tz is None or str(tz) != str(self.values.tz): raise ValueError("incompatible or non tz-aware value")
E ValueError: incompatible or non tz-aware value
pandas/core/internals/blocks.py:2898: ValueError
It fails because of above ValueError which is passed on to ObjectBlock.replace as part of error handling in Block.replace. so it appears it is expected to be handled after the block is casted as object and try again.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback made some changes now we don't touch above error but instead make sure we don't enter infinite loop.
added test cases too from two issues linked. 🚀
@jreback gentle nudge.
# GH 19266 and GH 21977 |
---|
# ValueError triggers try except block in Block.replace |
# causing RecursionError |
raise Exception("cannot assign mismatch length " |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you make this a bit more informative from a user POV?
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure. this change here is exposed to user so when one tries to assign an array of n-dimension to mask of m-dimension, when m != n, this error throws.
import pandas as pd s = pd.Series(range(10)) mask = s > 5
s[mask].shape (4,)
s[mask] = [5, 4, 3, 2, 1]
File "/Users/mingli/.virtualenvs/cust-value/lib/python3.5/site-packages/pandas/core/internals.py", line 1006, in putmask raise ValueError("cannot assign mismatch " ValueError: cannot assign mismatch length to masked array
This is also used internally by replace
and ValueError it raises is causing infinite loop as per linked issues. Whilst changing that to not ValueError will solve it, it also makes it less clear and potentially breaks users' codes if they expect to catch ValueError running above code snippet.
Do you think it's ok to use Exception here as it would solve the issues linked but users might have to change their code?
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is reverted.
Hello @minggli! Thanks for updating the PR.
Cheers ! There are no PEP8 issues in this Pull Request. 🍻
Comment last updated on August 07, 2018 at 13:23 Hours UTC
@@ -603,6 +603,20 @@ def test_replace_list(self): |
---|
assert_frame_equal(res, expec) |
def test_replace_with_empty_list(self): |
# GH 21977 |
s = pd.Series([['a', 'b'], np.nan, [1]]) |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean to include an empty list in this test case (exists in the series
module one)
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point!
@@ -667,7 +667,8 @@ Reshaping |
---|
- Bug in :meth:`DataFrame.replace` raises RecursionError when converting OutOfBounds ``datetime64[ns, tz]`` (:issue:`20380`) |
- :func:`pandas.core.groupby.GroupBy.rank` now raises a ``ValueError`` when an invalid value is passed for argument ``na_option`` (:issue:`22124`) |
- Bug in :func:`get_dummies` with Unicode attributes in Python 2 (:issue:`22084`) |
- |
- Bug in :meth:`DataFrame.replace` raises RecursionError when replace empty lists (:issue:`22083`) |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very minor nits but put RecursionError
in double backticks and change replace
to replacing
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks! changed.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor comments, ping on green.
convert=convert) |
---|
# GH 22083, TypeError or ValueError occurred within error handling |
# causes infinite loop. Cast and retry only if not objectblock. |
if self.dtype == 'object': |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you use is_object_dtype
# causes infinite loop. Cast and retry only if not objectblock. |
---|
if self.dtype == 'object': |
raise |
else: |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't need this else here
minggli deleted the bugfix/replace_recursion branch
Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this pull request