BUG: Add type promotion support for eval() expressions with many properties by xenoscopic · Pull Request #6205 · pandas-dev/pandas (original) (raw)
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Conversation36 Commits1 Checks0 Files changed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
[ Show hidden characters]({{ revealButtonHref }})
This commit modifies the call to numpy.result_type to get around the
NPY_MAXARGS limit, which at the moment is 32. Instead of passing a
generator of all types involved in an expression, the type promotion
is done on a pair-wise basis with a call to reduce.
This fixes bugs for code such as the following:
from numpy.random import randn
from pandas import DataFrame
d = DataFrame(randn(10, 2), columns=list('ab'))
# Evaluates fine
print(d.eval('*'.join(['a'] * 32)))
# Fails to evaluate due to NumPy argument limits
print(d.eval('*'.join(['a'] * 33)))
If someone could chime in to just double check my understanding of the behavior of numpy.result_type, that would be nice.
I think this is an important fix, otherwise it is essentially impossible to evaluate expressions whose tree representation have more than 32 leaves. Given that this limit seems imposed by NumPy, I assume it was not a conscious decision in the design of Pandas.
I wasn't aware of this limit. Thanks for catching it.
# use reduce instead of passing the full generator as *args to |
---|
# result_type, because for large expressions there are often more than |
# NPY_MAXARGS (32) types |
return reduce(np.result_type, |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should move reduce
to pandas.compat
and import from there since there's no builtin reduce
in Python 3
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like reduce is actually already in pandas.compat, I can go ahead and change to that. Do you have any qualms about just doing:
from pandas.compat import reduce
and overriding the built-in reduce for Python 2.x? It's actually the same as the built-in for 2.x, but I can rename it something like compat_reduce depending on your coding preferences.
No rush or anything on the review. I want to make sure that my understanding of the result_type function (particularly in the context of the eval() infrastructure) is correct so this doesn't start returning some weird types from eval.
I've gone ahead and added a reduce import from pandas.compat as well and squashed it into this commit.
@havoc-io pls add the above test (which should fail without this change, and pass with it)
Importing from compat is fine
Added a test which fails without the change and succeeds with it. The test will raise an exception under the old code. Do you prefer to catch and then manually trigger failure - or just let the test framework catch the exceptions and mark them as E?
# there is no way to get the value programmatically inside Python, so |
---|
# we just hardcode a value > 32 here |
df.eval('*'.join(('a') * 33)) |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, you need to assert that this works, e.g. compare with a known value and use assert_frame_equal
by definition w/o the change this will fail but with the change will return the correct value (and thus the test will pass)
so contruct the result in way and compare
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, just use pd.eval
here. There's no need to construct a DataFrame
. Just something like
def check_many_exprs(parser, engine): a = 1 expr = ' * '.join('a' * 33) expected = 1 res = pd.eval(expr, parser=parser, engine=engine) tm.assert_equal(res, expected)
def test_many_exprs(): for parser, engine in PARSERS_ENGINES: yield check_many_exprs, parser, engine
We generally frown upon yield
style nose tests but they are useful to avoid repetition of test functions only differing in their arguments.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does implementing the test in TestEvalNumexprPandas not implemented it for all parser engines? If not, where are you suggesting this test code be put?
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't automatically give pd.eval
the parser and engine, it's a class attribute (as opposed to an instance attribute). You have to pass self.parser
and self.engine
when you call pd.eval
. You can probably just copy and paste what I wrote above and put that into test_eval.py
. I'm not sure if PARSERS_ENGINES
is called that, it might be something else, but it's just itertools.product(PARSERS, ENGINES)
. PARSERS
and ENGINES
are tuple
s or list
s (can't remember) of the supported parsers/engines.
I've updated the commit. The issue was marginally more extensive that I originally anticipated, so rather than scatter calls to reduce
everywhere, I've added a pandas.computation.common._result_type_many
utility function as a wrapper for calling numpy.result_type
when passing an indefinite and potentially large number of arguments. It also slightly changes the behavior of my previous code to ensure that the call correctly succeeds for argument lists of length 1.
I've left in-place calls to numpy.result_type
which have a small, definite number of arguments, since the wrapper function is not necessary.
In addition, I've added the test recommended by @cpcloud (thanks!).
Another thing to check is the associativity and commutativity of result_type. If it's neither then using reduce isn't going to give the same result as result_type. Intuitively I would expect it to be both, akin to a sort of max operation on dtypes, but maybe there's some edge case that I'm not thinking of.
That was my intuition as well, and does appear to be the case. The NumPy result_type call stack (since NumPy 1.6) effectively looks like:
- return_type (an alias for the C function array_result_type which effectively just repackages the input array to be C-friendly), which calls
- PyArray_ResultType which calls
- PyArray_PromoteTypes
The C code is a bit difficult to parse from a conceptional standpoint, but the documentation for the latter two functions gives a more concise explanation - which does indeed portray the functions as an effective "max" operation on dtypes. PyArray_PromoteTypes
even explicitly states that it is symmetric and associative.
I don't think it would be possible to write an exhaustive test to confirm this, but I think the algorithm described in the documentation is quite clearly symmetric and associative. Aside from pinging the numpy developers, I'm not really sure how else to verify.
try: |
---|
first = next(arg_iter) |
except StopIteration: |
raise ValueError('at least one array or dtype is required') |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just do
if not arrays_and_dtypes: raise ValueError("your message")
Creating an iterator and catching StopIteration
is a bit overkill.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand why you originally did that, but you don't need to pass the initial argument to reduce
, it'll use the first element of the sequence if you don't pass anything.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that if the argument sequence has a length of only 1, then reduce will never pass it to numpy.result_type
, and consequently it won't come back as a dtype, it'll just be whatever the original argument is.
Okay, done, have a peek at the revised version.
looks ok 2 me ... i 'll merge when it passes
@havoc-io pls add a release notes (you will have to rebase as I just added the 0.14.0 section); add in bug fixes pls (you can ref this PR)
I've added an entry to the 0.14.0 bug fix log.
Looks like CI build passed - good to merge?
pls rebase on master; the release notes are in conflict
as soon as you do I can merge
…erties
This commit modifies the call to numpy.result_type to get around the NPY_MAXARGS limit, which at the moment is 32. Instead of passing a generator of all types involved in an expression, the type promotion is done on a pair-wise basis with a call to reduce.
This fixes bugs for code such as the following:
from numpy.random import randn
from pandas import DataFrame
d = DataFrame(randn(10, 2), columns=list('ab'))
# Evaluates fine
print(d.eval('*'.join(['a'] * 32)))
# Fails to evaluate due to NumPy argument limits
print(d.eval('*'.join(['a'] * 33)))
Should be sorted now. Thanks.
jreback added a commit that referenced this pull request
BUG: Add type promotion support for eval() expressions with many properties
jreback added a commit that referenced this pull request
…kipping
if not installed
@havoc-io FYI, I had to move this to a class (rather than the global checker), as it didn't properly disable numexpr when necessary (e.g. when it was not available
it works, but pls confirm that it is still correctly checking
gr8! this didn't show up till the windows builds started failing...they don't have numexpr installed at all