WIP/REF: BlockManager.setitem_blockwise by jbrockmendel · Pull Request #39302 · pandas-dev/pandas (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation7 Commits24 Checks0 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

jbrockmendel

Posting largely for discussion with @phofl.

Motivations:

  1. Try to end up with fewer code paths for setitem so that we can ensure consistent inplace/view/copy behavior (xref API: setitem copy/view behavior ndarray vs Categorical vs other EA #38896)
  2. Simplify (ATM this does not simplify things)
  3. Perf: iterating over blocks instead of columns may save some copies (though doing the Index.intersection calls may go the other way) (havent done any profiling yet)

Non-trivial overlap with #39044.
Some of this would be simplified by #39163.
No logical overlap with #39266, but will require rebase.

@pep8speaks

@phofl

Is there a specific place where I can be of help?

@jbrockmendel

Mostly just want to keep you in the loop on what i have in mind since you're working on similar parts of the code.

@jbrockmendel

@phofl this fails the test added in #39280. Thoughts on an appropriate fix?

(a branch that only has the edits to DataFrame._setitem_array has the same failure)

@phofl

Not per se a fix, but I had similar difficulties in #39341

We have to account for duplicate column names in these checks. Maybe using get_indexer_for or something like that if there is a faster alternative. Simply comparing the lenght does not work with duplicates

@phofl

You could use

if self.columns.is_unique:
    len_key = len(key)
else:
    len_key = len(self.columns.get_indexer_non_unique(key))
if len(value) != len_key:
    raise ValueError("Columns must be same length as key")

as a helper functions for both len(value) != len(key) checks. Test will keep failing because the column is now cast to integer, but this is good since this is consistent with other setitem cases. If you like I could implement this in #39341. I am dispatching to the else block there right now, since this does the trick currently.

@jbrockmendel

Labels

Indexing

Related to indexing on series/frames, not to indexes themselves