BUG: maximum recursion error when replacing empty lists by minggli · Pull Request #22083 · pandas-dev/pandas (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation22 Commits22 Checks0 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

minggli

Internal mechanism in replace method is intertwined and can cause RecursionError (not very helpful). Discussed in #21977, there doesn't appear to be first class support for ExtensionArray yet and this issue is right now treated as error handling, having investigated treating it as new use case but not easily co-exist without large change.

The RecursionError is caused by Block.replace and ObjectBlock.replace calling each other under certain circumstances such as in the linked issues.

New approach is now to prevent re-casting of blocks to object blocks if it is already an ObjectBlock and instead raise the error caught as it is, because if it's an ObjectBlock already then ValueError and TypeError raised won't go away when re-casted to object and re-try, resulting in infinite loop and RecursionError.

@minggli

@minggli

@minggli

@minggli

jreback

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add the example as a test

@@ -1000,8 +999,8 @@ def putmask(self, mask, new, align=True, inplace=False, axis=0,
if not (mask.shape[-1] == len(new) or
mask[mask].shape[-1] == len(new) or
len(new) == 1):
raise ValueError("cannot assign mismatch "

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this need to change? (or is this what is actually causing the issue); this is not exposed to the user, correct?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this ValueError caused Block.replace and ObjectBlock.replace to form infinite loop and ultimately raise RecursionError. It's because Block.replace catches exception include ValueError and if it fails, delegate replace to ObjectBlock. But ObjectBlock.replace also uses super(ObjectBlock, self), hence the loop. The exception handling includes TypeError and ValueError and they are both depended on (as appear to me) by other casting cases.

Open to suggestion, not sure a better way to handle, at the moment this change is needed to address the issue in question.

This can be exposed to user when:

import pandas as pd

s = pd.Series(range(10)) mask = s > 5

s[mask] = [5, 4, 3, 2, 1]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you know why we are catching ValueError at a higher level, maybe can get away with only catching TypeError and let ValueError buble up?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point and it appears below test (only this test) fails when we only catch TypeError:

_________________ TestDataFrameReplace.test_replace_datetimetz _________________
# coerce to object
result = df.copy()
result.iloc[1, 0] = np.nan
result = result.replace(
{'A': pd.NaT}, Timestamp('20130104', tz='US/Pacific'))

pandas/tests/frame/test_replace.py:1060:


   elif isinstance(other, (np.datetime64, datetime, date)):
       other = tslibs.Timestamp(other)
       tz = getattr(other, 'tz', None)

       # test we can have an equal time zone
       if tz is None or str(tz) != str(self.values.tz):
        raise ValueError("incompatible or non tz-aware value")

E ValueError: incompatible or non tz-aware value

pandas/core/internals/blocks.py:2898: ValueError

It fails because of above ValueError which is passed on to ObjectBlock.replace as part of error handling in Block.replace. so it appears it is expected to be handled after the block is casted as object and try again.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback made some changes now we don't touch above error but instead make sure we don't enter infinite loop.

@minggli

@minggli

@minggli

@codecov

@minggli

added test cases too from two issues linked. 🚀

@jreback gentle nudge.

jreback

# GH 19266 and GH 21977
# ValueError triggers try except block in Block.replace
# causing RecursionError
raise Exception("cannot assign mismatch length "

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you make this a bit more informative from a user POV?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure. this change here is exposed to user so when one tries to assign an array of n-dimension to mask of m-dimension, when m != n, this error throws.

import pandas as pd s = pd.Series(range(10)) mask = s > 5

s[mask].shape (4,)

s[mask] = [5, 4, 3, 2, 1]

File "/Users/mingli/.virtualenvs/cust-value/lib/python3.5/site-packages/pandas/core/internals.py", line 1006, in putmask raise ValueError("cannot assign mismatch " ValueError: cannot assign mismatch length to masked array

This is also used internally by replace and ValueError it raises is causing infinite loop as per linked issues. Whilst changing that to not ValueError will solve it, it also makes it less clear and potentially breaks users' codes if they expect to catch ValueError running above code snippet.

Do you think it's ok to use Exception here as it would solve the issues linked but users might have to change their code?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is reverted.

@minggli

@minggli

@pep8speaks

Hello @minggli! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on August 07, 2018 at 13:23 Hours UTC

@minggli

@minggli

WillAyd

@@ -603,6 +603,20 @@ def test_replace_list(self):
assert_frame_equal(res, expec)
def test_replace_with_empty_list(self):
# GH 21977
s = pd.Series([['a', 'b'], np.nan, [1]])

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean to include an empty list in this test case (exists in the series module one)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point!

@@ -667,7 +667,8 @@ Reshaping
- Bug in :meth:`DataFrame.replace` raises RecursionError when converting OutOfBounds ``datetime64[ns, tz]`` (:issue:`20380`)
- :func:`pandas.core.groupby.GroupBy.rank` now raises a ``ValueError`` when an invalid value is passed for argument ``na_option`` (:issue:`22124`)
- Bug in :func:`get_dummies` with Unicode attributes in Python 2 (:issue:`22084`)
-
- Bug in :meth:`DataFrame.replace` raises RecursionError when replace empty lists (:issue:`22083`)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very minor nits but put RecursionError in double backticks and change replace to replacing

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! changed.

@minggli

@minggli

WillAyd

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

jreback

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comments, ping on green.

convert=convert)
# GH 22083, TypeError or ValueError occurred within error handling
# causes infinite loop. Cast and retry only if not objectblock.
if self.dtype == 'object':

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you use is_object_dtype

# causes infinite loop. Cast and retry only if not objectblock.
if self.dtype == 'object':
raise
else:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't need this else here

@minggli

jreback

@jreback

@minggli minggli deleted the bugfix/replace_recursion branch

August 7, 2018 16:00

Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this pull request

Oct 1, 2018

@minggli