ENH: make is_list_like handle non iterable numpy-like arrays correctly by znicholls · Pull Request #35127 · pandas-dev/pandas (original) (raw)

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

closes ENH: Identify numpy-like zero-dimensional arrays as non-iterable #35131
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

Hello @znicholls! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-02-13 08:43:11 UTC

znicholls changed the title~~Non iterable duck arrays~~ is_list_like correctly handles non iterable duck arrays

Jul 5, 2020

This was referenced

Jul 5, 2020

znicholls changed the title~~is_list_like correctly handles non iterable duck arrays~~ is_list_like correctly handles non iterable numpy-like arrays

Jul 5, 2020

znicholls changed the title~~is_list_like correctly handles non iterable numpy-like arrays~~ make is_list_like handle non iterable numpy-like arrays correctly

Jul 5, 2020

znicholls marked this pull request as ready for review

July 5, 2020 23:11

This is likely to have a non-trivial effect on performance. It might be time to implement a separate is_list_like (and possibly is_scalar) with less-strict/EA-customizable behavior (cc @jorisvandenbossche I know this has come up w/r/t geopandas)

Can you identify a subset of is_list_like usages where we need the less-strict behavior?

Can you identify a subset of is_list_like usages where we need the less-strict behavior?

I can check with the other pint-pandas developers.

Can I also ask which benchmarks I should look at to examine whether performance has taken a hit?

I'm not completely sure why, but reverting here for simplicity

yeah, sorry, getattr only checks for AttributeError but pint raises TypeError

yeah, sorry, getattr only checks for AttributeError but pint raises TypeError

I think it would have to be getattr(obj, "ndim", 1) != 0 i.e. you assume everything is iterable unless you have evidence otherwise.

I think the test failures are unrelated to our change?

# CI / Web and docs failure text (I think this is the relevant bit)
...
WARNING: autodoc: failed to import method 'sparse.from_spmatrix' from module 'DataFrame'; the following exception was raised:
No module named 'DataFrame'
WARNING: autodoc: failed to import method 'sparse.to_coo' from module 'DataFrame'; the following exception was raised:
No module named 'DataFrame'
WARNING: autodoc: failed to import method 'sparse.to_dense' from module 'DataFrame'; the following exception was raised:
No module named 'DataFrame'
WARNING: autodoc: failed to import method 'sparse.from_coo' from module 'Series'; the following exception was raised:
No module named 'Series'
WARNING: autodoc: failed to import method 'sparse.to_coo' from module 'Series'; the following exception was raised:
No module named 'Series'
...

znicholls changed the title~~make is_list_like handle non iterable numpy-like arrays correctly~~ ENH: make is_list_like handle non iterable numpy-like arrays correctly

Jul 6, 2020

This is likely to have a non-trivial effect on performance.

To have an idea about this, I timed the following cases:

a1 = [1, 2, 3]
a2 = np.array([1, 2, 3])
a3 = np.array([1, 2, 3])[0]

In [2]: %timeit pd.api.types.is_list_like(a1)      
390 ns ± 1.56 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)   # master  
553 ns ± 1.77 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)   # PR

In [3]: %timeit pd.api.types.is_list_like(a2) 
405 ns ± 0.972 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)   # master
408 ns ± 0.745 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)   # PR

In [4]: %timeit pd.api.types.is_list_like(a3)  
395 ns ± 1.6 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)   # master
411 ns ± 1.55 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)    PR

So for numpy arrays the slowdown is only very limited (I assume because in this case we needed to look up obj.ndim anyway, so then then getattr(obj, "ndim") seems to be similar). For actual lists, the slowdown is more significant, though (but I would think the array cases are most relevant for performance).

It might be time to implement a separate is_list_like (and possibly is_scalar) with less-strict/EA-customizable behavior

It's not only EAs, though, but eg also for object dtype (see #26333, #27911)

But indeed, a less strict is_list_like check depending on the dtype of the involved data might be worth exploring, although not that easy I think. The first thing would be to identify the cases where this less strict case is actually important. I suppose the problems in pint-pandas mainly occur in indexing + arithmetic/comparison operations ?

These were the tests which started passing when deleting the __iter__ method from the scalar class to make is_list_like return False:

Details

TestGroupby::test_groupby_apply_identity
TestMethods::test_fillna_copy_frame
TestMethods::test_fillna_copy_series
TestMissing::test_fillna_scalar
TestMissing::test_fillna_series
TestMissing::test_fillna_frame
TestReshaping::test_unstack[series-index1]
TestReshaping::test_unstack[series-index3]
TestReshaping::test_unstack[frame-index1]
TestReshaping::test_unstack[frame-index3]
TestSetitem::test_setitem_mask_broadcast[loc]
TestSetitem::test_setitem_sequence_broadcasts[True]
TestSetitem::test_setitem_integer_array[True-list]
TestSetitem::test_setitem_integer_array[True-integer-array]
TestSetitem::test_setitem_integer_array[True-numpy-array]
TestSetitem::test_setitem_slice[True]
TestSetitem::test_setitem_loc_iloc_slice

It looks like using numpy.iterable speeds up the list and array cases but not the numpy scalar one (which I think makes sense as you have to do some more thinking on that one). (This comment is pending tests passing...)

import numpy as np import pandas as pd

a1 = [1, 2, 3] a2 = np.array([1, 2, 3]) a3 = np.array([1, 2, 3])[0]

%timeit pd.api.types.is_list_like(a1) 289 ns ± 16.6 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) # PR 447 ns ± 7.82 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) # master

%timeit pd.api.types.is_list_like(a2) 302 ns ± 10.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) # PR 466 ns ± 4.96 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) # master

%timeit pd.api.types.is_list_like(a3) 596 ns ± 2.6 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) # PR 440 ns ± 1.68 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) # master

Can you identify a subset of is_list_like usages where we need the less-strict behavior?

Having dug a bit deeper, we're going to need to back to the drawing board on this. My impression is that in order to be handled by is_list_like, is_scalar and item_from_zerodim correctly, the class has to basically be a numpy array (or sub-class thereof). I'll go back to the pint-pandas devs and we can discuss how we go forward there, including identifying the cases where a less strict is_list_like, is_scalar and item_from_zerodim would solve our problems.

I'm happy to close this in the meantime. However it would be great if we could reopen #22582 to at least add pint-pandas to the docs (buggy as it is).

I would be nice to talk to geopandas devs to see if we can join forces. We might be facing the same problem.

@hgrecco you are alread talking to them ;) (well, at least, I am one of the geopandas devs)
See also my comment at #35131 (comment): although very much related, I think it's not exactly the same. As for GeoPandas, the scalar elements we put in a Series are actually iterable (in contrast to the (duck-typed) numpy scalars, which are iterable in theory (have the dunder method), but raise once you try to iterate).

It looks like using numpy.iterable speeds up the list and array cases

It seems that np.iterable is actually just a try/except of iter(value):

    try:
        iter(y)
    except TypeError:
        return False
    return True

(if we decide to use this, I would prefer doing the code above explicitly instead of using np.iterable, as I think most people are not familiar with that numpy method, and the code is simple).

But so basically it is a choosing between isinstance(value, Iterable) + special case check to exclude ndim==0 vs try/except of iter(value) (not fully sure if there are other consequences of the choice between both)

@jorisvandenbossche thanks for the clarification.

@znicholls Have you tried reinstating getattr(obj, "ndim", 1) != 0 and moving it first? Python's and/or are shortcircuit operators so we might want to put first those checks which are fast and likely to return false.

Reverting the changes as suggested by @hgrecco leads to a slowdown for lists, no change for numpy arrays and a speedup for numpy scalars (essentially the reverse of previously)

import numpy as np import pandas as pd

a1 = [1, 2, 3] a2 = np.array([1, 2, 3]) a3 = np.array([1, 2, 3])[0]

%timeit pd.api.types.is_list_like(a1) 640 ns ± 6.82 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) # PR 447 ns ± 7.82 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) # master

%timeit pd.api.types.is_list_like(a2) 414 ns ± 17.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) # PR 466 ns ± 4.96 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) # master

%timeit pd.api.types.is_list_like(a3) 385 ns ± 10.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) #PR 440 ns ± 1.68 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) # master

We still have the is_scalar and item_from_zerodim issues of course...

The backstory is #35131. Hopefully that explains why we are looking to make such a change? (@jorisvandenbossche just double checking no progress has been made elsewhere that I've missed?)

A concern about performance impacts was raised in this comment #35127 (comment). After some discussion, we arrived at #35127 (comment).

For completeness, we also did some checks about performance penalties, e.g. #35127 (comment), #35127 (comment) and #35127 (comment). None of our solutions was perfect.

I am not sure what your proposed implementation even looks like.

That makes sense, because I am also unsure. I've only got to step 1 of #35127 (comment). Based on the 'hot spots' identified in master...znicholls:handle-extension-arrays, I am looking for advice about whether we could use less strict (potentially slower) implementations of is_list_like in these places.

If we can use less strict implementations, my only proposed implementation is the changes made in this PR i.e. master...znicholls:non-iterable-duck-arrays. In short, the less strict implementation would check for objects which are numpy-style zero-dimensional arrays with not (hasattr(obj, "ndim") and obj.ndim == 0) rather than not (util.is_array(obj) and obj.ndim == 0) because the latter only works for numpy arrays and doesn't work for numpy-style arrays (such as Pint).

what is the status here

I would greatly appreciate your (or another pandas developer) thoughts on #35127 (comment). I'm not sure I can sum it up any more succinctly than what is in that comment.

If we can use less strict implementations, my only proposed implementation is the changes made in this PR i.e. master...znicholls:non-iterable-duck-arrays. In short, the less strict implementation would check for objects which are numpy-style zero-dimensional arrays with not (hasattr(obj, "ndim") and obj.ndim == 0) rather than not (util.is_array(obj) and obj.ndim == 0) because the latter only works for numpy arrays and doesn't work for numpy-style arrays (such as Pint).

i think this is reasonable. of course asv's will tell us whether its an acceptable perf hit. trying to balance additional complexity (seems minor in this case) with perf vs enhancements.

i think this is reasonable. of course asv's will tell us whether its an acceptable perf hit. trying to balance additional complexity (seems minor in this case) with perf vs enhancements.

Ok great thank you, I will start on that path then.

@jreback I have tried to start small with #39790, if you could take a look that would be great

Now pd.core.dtypes.inference.is_list_like correctly identifies numpy-like scalars as not being iterable

Co-authored-by: keewis keewis@users.noreply.github.com

I'm not completely sure why, but reverting here for simplicity

Thanks for the PR draft, but it appears this draft has stalled in development. Probably best if this change starts in a smaller PR with continuation of #39790 (let us know if you'd like to continue in that PR. Closing.

What exactly is it that is missing here? There seems to be quite a number of issues in many downstream projects, all stalling on #35131. I myself would love to see pint-pandas move forward and would be willing to picking up this PR.

If I understand correctly, the main concern with the proposed solution here is the slight performance regression. To me, that then asks for performance re-optimisation rather than introducing two different versions of is_list_like that do slightly different things - the latter is calling for all kinds of subtle bugs. The size of the diff in #39830 (2500 changed lines) with the solution presented here (150 changed lines) also speaks for this approach here.

As for the performance re-optimisation: It seems there is no regression on actual numpy arrays, only on real lists. Under the assumption that real lists are the only relevant non-array types, we might be able to get by with a simple short-cut test for actual lists. Would an iteration on this PR along these lines be appreciated?

@burnpanck like anything in pandas a proper well formed and tested PR would be considered

ENH: make is_list_like handle non iterable numpy-like arrays correctly by znicholls · Pull Request #35127 · pandas-dev/pandas (original) (raw)

Conversation

Comment last updated at 2021-02-13 08:43:11 UTC

Labels