Follow-up #20347: incorporate review about _get_series_list by h-vetinari · Pull Request #20923 · pandas-dev/pandas (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation22 Commits3 Checks0 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

Closes #20922

recap of #20347 :
Enabling alignment on top of the (v.0.22-)legal list of lists was relatively complex. The allowed argument types implemented in #20347 are as follows:

Type of (element of) "others"        | alone | within list-like
---------------------------------------------------------------------
pd.Series                            |  yes  | yes
pd.Index                             |  yes  | yes
np.ndarray (1-dim)                   |  yes  | yes
DataFrame                            |  yes  | no
np.ndarray (2-dim)                   |  yes  | no
iterators, dict-views, etc.          |  yes  | **
anything else w/ is_list_like = True |  yes  | **

** to be permitted, list-likes (say nxt) within a list-like others must pass (essentially)
all(not is_list_like(x) for x in nxt)
and not be a DF.

Open review points from #20347 from @jreback (and my responses) reproduced in review comments.

PS. Managed to have a brainfart and there's a bug in the RC docs - very sorry. Fixed in this PR.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# without copy, this could change "others"
# that was passed to str.cat
others = others.copy()
others.index = idx
return ([others[x] for x in others], fu_wrn)
return ([others[x] for x in others], warn)
elif isinstance(others, np.ndarray) and others.ndim == 2:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback :

this is wrong
i don’t think we can align a ndarray at all like this
let’s can ndarray a that are > 1 dim

My response:
The DF-constructor works as expected for a 2-dim ndarray, but I haven't checked if this is tested behaviour. (essentially, df == DataFrame(df.values, columns=df.columns, index=df.index))

I would suggest not to can 2-dim ndarrays, because they are necessary to avoid alignment on the deprecation path for join:

[...] To disable alignment (the behavior before v.0.23) and silence this warning, use .values on any Series/Index/DataFrame in others. [...]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a test for this, basically see what happens when I have a non-default index with the Series (e.g. 2,3, 4) or something and it gets aligned with the 0,1,2 of the ndarray-converted-to-DataFrame. It will 'work' but is wrong.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a general comment before all the cases for others because it seems this point wasn't clear:

# Generally speaking, all objects without an index inherit the index
# `idx` of the calling Series/Index - i.e. must have matching length.
# Objects with an index (i.e. Series/Index/DataFrame) keep their own
# index, *unless* ignore_index is set to True.

So, the case you mention does not happen, because an ndarray is always automatically aligned with the calling Series (of course, the lengths must match for this to work).

There are several tests with objects with different indexes, both with join is None (so with force-realign; successful but raising a warning), as well as with the different join-types.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i c, ok then

elif isinstance(others, np.ndarray) and others.ndim == 2:
others = DataFrame(others, index=idx)
return ([others[x] for x in others], False)
elif is_list_like(others):
others = list(others) # ensure iterators do not get read twice etc
if all(is_list_like(x) for x in others):
los = []
fu_wrn = False

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback :

can u name this parameter just warn

Done

fu_wrn = fu_wrn or fwn
return (los, fu_wrn)
# test if there is a mix of list-like and non-list-like (e.g. str)
elif (any(is_list_like(x) for x in others)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback

you can make this simpler by just checking for all is not list like (eg strings)
anything else will fail thru to the TypeError

Done

while others:
nxt = others.pop(0) # list-like as per check above
# safety for iterators and other non-persistent list-likes

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback :

this whole section needs some work it’s way too hard to read and follow

I tried to comment in detail what's happening - which part is unclear?

is_legal = ((no_deep and nxt.dtype == object)
or all((isinstance(x, compat.string_types)
or (not is_list_like(x) and isnull(x))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback :

isnull already checks for None
only 1d objects are valid here (or all scalars)
do this check up front

I simplified the check, but I can't put it any further upfront (unless I remove the fast track). Need to convert iterators first, and the checking for fast-track for Series/Index etc. also needs to be done before

@TomAugspurger from #20922

Some requests for simplification. Most of the complexity seems to be in the code inferring what kind of regime we're in (one or more alignable objects vs. one or more unalignable objects vs. mix). Being strict here (after deprecating) or requiring additional user input may be worthwhile from a maintenance perspective.

What do you mean by being strict. Could you maybe edit the table from above to reflect what you think should be allowed?

@TomAugspurger
Since I saw you're doing an RC2 due to #20924, would you consider merging this PR (or a separate one) for fixing the embarassing (for me) doc-error?

This PR will (almost certainly) be fully green in ~1h.

Nope, just doing build stuff right now.

# without copy, this could change "others"
# that was passed to str.cat
others = others.copy()
others.index = idx
return ([others[x] for x in others], fu_wrn)
return ([others[x] for x in others], warn)
elif isinstance(others, np.ndarray) and others.ndim == 2:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

other than my comment above the cleanup looks good.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback. I think the point you mention is a misunderstanding, but I pushed another commit to make it clearer what's happening.

# without copy, this could change "others"
# that was passed to str.cat
others = others.copy()
others.index = idx
return ([others[x] for x in others], fu_wrn)
return ([others[x] for x in others], warn)
elif isinstance(others, np.ndarray) and others.ndim == 2:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a general comment before all the cases for others because it seems this point wasn't clear:

# Generally speaking, all objects without an index inherit the index
# `idx` of the calling Series/Index - i.e. must have matching length.
# Objects with an index (i.e. Series/Index/DataFrame) keep their own
# index, *unless* ignore_index is set to True.

So, the case you mention does not happen, because an ndarray is always automatically aligned with the calling Series (of course, the lengths must match for this to work).

There are several tests with objects with different indexes, both with join is None (so with force-realign; successful but raising a warning), as well as with the different join-types.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -311,7 +311,7 @@ All one-dimensional list-likes can be arbitrarily combined in a list-like contai

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another point of @TomAugspurger in #20347, responding to the line
All one-dimensional list-likes can be arbitrarily combined in a list-like container (including iterators, ``dict``-views, etc.):
(which is just above this line pointer):

@TomAugspurger

question: what happens with a dict? Do we use the keys, or does it transform to a series and get aligned?

Response 1:

I was talking about d.keys() or d.values(). For passing a dict directly, currently the keys would get read (as that's what x in d would return).

Response 2:

I could of course add another safety that maps d to d.values().
However, I tend to think that this could be left to the python-skills of the user as well -- a dictionary is not "list-like" from the POV of normal python usage (whereas its keys and values are, see d = dict(zip(keys, values))).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a dict is treated consistently here how we handle list-likes else where, we effectively call list on it

@@ -311,7 +311,7 @@ All one-dimensional list-likes can be arbitrarily combined in a list-like contai

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a dict is treated consistently here how we handle list-likes else where, we effectively call list on it

# without copy, this could change "others"
# that was passed to str.cat
others = others.copy()
others.index = idx
return ([others[x] for x in others], fu_wrn)
return ([others[x] for x in others], warn)
elif isinstance(others, np.ndarray) and others.ndim == 2:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i c, ok then

a dict is treated consistently here how we handle list-likes else where, we effectively call list on it

In that case, I think this PR is good to go.

@jreback @TomAugspurger @jorisvandenbossche
re: treatment of dicts:
FWIW, concat does a very similar kind of operation to str.cat, and does treat dicts differently (than calling list on them):

d = {'a' : pd.Series([0, 1]), 'b': pd.Series(['c', 'd'])}
pd.concat(d, axis=1)
#    a  b
# 0  0  c
# 1  1  d
pd.concat(d.values(), axis=1)
#    0  1
# 0  0  c
# 1  1  d
pd.concat(list(d), axis=1)
# TypeError

OTOH, I'm honestly fine to leave this PR as it is. Special treatment for dicts should be a separate issue, IMHO.

thanks @h-vetinari good followup!

as far as dicts go, certainly welcome clarification if needed (PR / issue all ok), or can just wait and see if that usecase is seen in real life

@jreback
Glad this PR is done, TBH. Thanks for all the reviews and feedback!