API / CoW: DataFrame(, copy=False) constructor now gives lazy copy by jorisvandenbossche · Pull Request #50777 · pandas-dev/pandas (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation40 Commits23 Checks0 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

jorisvandenbossche

xref #48998 (bullet point about constructors)

When constructing a DataFrame from Series objects, and when specifying copy=False with CoW enabled, we need to follow the lazy copy with CoW approach to ensure proper copy/view behaviour with the original Series objects.

We could in theory change the default of copy=True (when data passed to the constructor is a dict) to copy=False when CoW is enabled, and this would retain the behaviour as-if the resulting DataFrame is a copy. But, I assume that the major reason for this default is to ensure the resulting DataFrame is consolidated and not consisting of all 1-column blocks. So for that reason I left the default as is for now.

@jorisvandenbossche

This was referenced

Jan 16, 2023

phofl

if hval is val:
hval = val.copy(deep=False)
homogenized.append(hval._values)
parents.append(hval)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this only relevant if reindex did not make a copy?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but in create_block_manager_from_column_arrays (in managers.py) I will only actually use this parent if it has a ref itself, and so if reindex did a copy here, this Series will not have any refs, and it will not be tracked as parent in the end

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True this would work, but the following does not work:

val = Series([1, 2, 3.0])
df = pd.DataFrame({"a": val}, dtype="int64", index=[0, 1, 2])


assert not np.shares_memory(get_array(df, "a"), val._values)

x = get_array(df, "a")
df.iloc[0, 0] = 100
assert np.shares_memory(get_array(df, "a"), x)

In this scenario astype makes a copy while reindex does not. Hence, hval has a ref to the object created from astype which is inherited by the constructed element that is our result. Consequentially, iloc triggers an unnecessary copy

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, yes, if you have both a reindex and astype, that makes things a bit more complicated ..

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated for this by checking in each step if an actual copy was made, and then if any of the steps did make a copy, we don't add the Series to the parents (thus, also if the first astype step copied, and then the next reindex step did not copy, we wouldn't add the parent, and not trigger a CoW unnecessarily)

And added a test that covers your example above (that was failing before the fix)

phofl

@jorisvandenbossche

@phofl this should be ready now

(the current CI failures are unrelated; the arm one also fails on main)

@phofl

This is a bit annoying, simply checking for new_values is old_values is not sufficient (same problem I am having in astype).

val = Series([1, 2, 3])
df = pd.DataFrame({"a": val}, dtype="Int64", index=[0, 1, 2])

x = get_array(df, "a")
df.iloc[0, 0] = 100

The is comparison is False in this case but the data aren't actually copied, hence val and df is updated with the last statement. I guess we have to create a helper function for astype doing a more sophisticated check...

Edit: Can you add this as a test case as well?

phofl

@jorisvandenbossche

@jorisvandenbossche

@jorisvandenbossche

Added an xfailed test case for constructing with dtype="Int64", assuming that #50802 will fix this.

@jorisvandenbossche

phofl

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of small comments

phofl

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more comments for the tests

)
# the shallow copy still shares memory
assert np.shares_memory(get_array(result, "a"), s1.values)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use get_array here?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also below

@jorisvandenbossche

@jorisvandenbossche

phofl

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx, lgtm. Good to merge on greenish

phofl pushed a commit to phofl/pandas that referenced this pull request

Feb 10, 2023

@jorisvandenbossche @phofl

2 participants

@jorisvandenbossche @phofl