API / CoW: DataFrame(, copy=False) constructor now gives lazy copy by jorisvandenbossche · Pull Request #50777 · pandas-dev/pandas (original) (raw)
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Conversation40 Commits23 Checks0 Files changed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
[ Show hidden characters]({{ revealButtonHref }})
xref #48998 (bullet point about constructors)
When constructing a DataFrame from Series objects, and when specifying copy=False
with CoW enabled, we need to follow the lazy copy with CoW approach to ensure proper copy/view behaviour with the original Series objects.
We could in theory change the default of copy=True
(when data passed to the constructor is a dict) to copy=False
when CoW is enabled, and this would retain the behaviour as-if the resulting DataFrame is a copy. But, I assume that the major reason for this default is to ensure the resulting DataFrame is consolidated and not consisting of all 1-column blocks. So for that reason I left the default as is for now.
- Tests added and passed if fixing a bug or adding a new feature
- All code checks passed.
- Added type annotations to new arguments/methods/functions.
- Added an entry in the latest
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.
This was referenced
Jan 16, 2023
if hval is val: |
---|
hval = val.copy(deep=False) |
homogenized.append(hval._values) |
parents.append(hval) |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't this only relevant if reindex did not make a copy?
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but in create_block_manager_from_column_arrays
(in managers.py) I will only actually use this parent if it has a ref itself, and so if reindex
did a copy here, this Series will not have any refs, and it will not be tracked as parent in the end
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True this would work, but the following does not work:
val = Series([1, 2, 3.0])
df = pd.DataFrame({"a": val}, dtype="int64", index=[0, 1, 2])
assert not np.shares_memory(get_array(df, "a"), val._values)
x = get_array(df, "a")
df.iloc[0, 0] = 100
assert np.shares_memory(get_array(df, "a"), x)
In this scenario astype makes a copy while reindex does not. Hence, hval has a ref to the object created from astype which is inherited by the constructed element that is our result. Consequentially, iloc triggers an unnecessary copy
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, yes, if you have both a reindex and astype, that makes things a bit more complicated ..
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated for this by checking in each step if an actual copy was made, and then if any of the steps did make a copy, we don't add the Series to the parents (thus, also if the first astype step copied, and then the next reindex step did not copy, we wouldn't add the parent, and not trigger a CoW unnecessarily)
And added a test that covers your example above (that was failing before the fix)
@phofl this should be ready now
(the current CI failures are unrelated; the arm one also fails on main)
This is a bit annoying, simply checking for new_values is old_values is not sufficient (same problem I am having in astype).
val = Series([1, 2, 3])
df = pd.DataFrame({"a": val}, dtype="Int64", index=[0, 1, 2])
x = get_array(df, "a")
df.iloc[0, 0] = 100
The is comparison is False in this case but the data aren't actually copied, hence val and df is updated with the last statement. I guess we have to create a helper function for astype doing a more sophisticated check...
Edit: Can you add this as a test case as well?
Added an xfailed test case for constructing with dtype="Int64", assuming that #50802 will fix this.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple of small comments
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some more comments for the tests
) |
---|
# the shallow copy still shares memory |
assert np.shares_memory(get_array(result, "a"), s1.values) |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use get_array here?
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also below
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thx, lgtm. Good to merge on greenish
phofl pushed a commit to phofl/pandas that referenced this pull request
2 participants