API: Add DataFrame.assign method by TomAugspurger · Pull Request #9239 · pandas-dev/pandas (original) (raw)
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
[ Show hidden characters]({{ revealButtonHref }})
Closes #9229
signature: DataFrame.transform(**kwargs)
- the keyword is the name of the new column (existing columns are overwritten if there's a name conflict, as in dplyr)
- the value is either
- called on self if it's callable. The callable should be a function of 1 argument, the DataFrame being called on.
- inserted otherwise
In [7]: df.head() Out[7]: sepal_length sepal_width petal_length petal_width species 0 5.1 3.5 1.4 0.2 setosa 1 4.9 3.0 1.4 0.2 setosa 2 4.7 3.2 1.3 0.2 setosa 3 4.6 3.1 1.5 0.2 setosa 4 5.0 3.6 1.4 0.2 setosa
In [8]: (df.query('species == "virginica"')
.transform(sepal_ratio=lambda x: x.sepal_length / x.sepal_width)
.head())
Out[8]:
sepal_length sepal_width petal_length petal_width species
100 6.3 3.3 6.0 2.5 virginica
101 5.8 2.7 5.1 1.9 virginica
102 7.1 3.0 5.9 2.1 virginica
103 6.3 2.9 5.6 1.8 virginica
104 6.5 3.0 5.8 2.2 virginica
sepal_ratio 100 1.909091
101 2.148148
102 2.366667
103 2.172414
104 2.166667
My question now is
- How strict should we be on the shape of the
transformedDataFrame? Should we do any kind of checking on the index or columns?
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should do a shallow copy if possible.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we never shallow copy
much more trouble than it's worth
only indexes are shallow
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the unintentional formatting on that :)
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I misunderstand pandas's datamodel, presuming that you could do a shallow copy and then update a column without modifying the original (like a dict). I see now that I was mistaken :(.
I like transform as the name; no-one in favor of mutate? any other possibilities?
only minus against transform is the use in groupby is somewhat different.
I like mutate better because:
- we already have
transformon groupby - we might want
mutateas grouped operation -- dplyr has it
If you intend to keep the function pure (as it is currently) then the term mutate might be misleading.
Neither transform nor mutate are very descriptive. It may make sense to seek out a better term, even if it means breaking from tradition.
FWIW, dlpyr's mutate is also pure, despite the misleading name.
Augment? Enhance? I'll keep thinking.
On Jan 13, 2015, at 18:32, Matthew Rocklin notifications@github.com wrote:
le sigh
—
Reply to this email directly or view it on GitHub.
I think we should go a slightly different route here.
I would change the signature of df.update(*args, **kwargs). Then you can detect if its the 'new' mode or the 'original' mode (you can simply look at the args/kwargs and figure this out unambiguously).
I would vote to effectively deprecate the original .update mode (of course ATM this would just show a FutureWarning. This function is really not necessary except in very rare circumstances, and is not well implemented internally.
df.update(A=df.B/df.C) looks like a winner to me. (same idea in that it IS pure, but has a more suggestive name than transform/mutate. Further we can add the same method to groupby).
That sounds pretty clean.
I wanted to say that df.add_column() is descriptive, but ugly name .. but I like update more!
Only, the combination with the existing use seems a bit confusing to users I think (but have to look in more detail)
+1 for update.
I still vote for allowing multiple variables at once :).
OK, another idea: df.assign?
Update is not very useful in pandas, but it is part of the standard mapping API, where it's known as a method that does an in-place operation.
Thanks for the input. My favorite is df.assign, it conveys the meaning well. I could live with update though. I only worry about the cognitive clash with dict.update(), which is inplace.
I'll also allow multiple variables I think. I'm going to do all of the calculations before the assignment so that we don't run into issues with one calculation depending on another, and having the success or failure of the call dependent upon the dict ordering.
More tests and docs coming soon.
Updated with some docs and handling multiple assigns. I wasn't sure about the best place for the docs so I threw it in basics.rst. I still need to build them to make sure everything looks good.
I am worried about people hitting subtle bugs with assigning multiple columns in one assign since the order won't be preserved. I've tried to document that.
I've got a few more things to clean up and then I'll ping for review.
TomAugspurger changed the title
API: Add DataFrame.transform method API: Add DataFrame.assign method
Ok, ready for feedback.
Just a summary,
- I went with
assign, but update could work - keyword arguments only (potentially multiple)
- if the value is callable, it's called on
self - if the value is not callable, it's inserted
Nice feature, and some points to be considered:
inplaceoption like other functions (but it results we cannot create column namedinplacethough)- Should
Seriesalso haveassignto make aDataFramefor consistency? - Better to care partial string slicing if
DataFramehasDatetimeIndexandPeriodIndex, which outputs unexpected results.
- I'm -1 on an inplace option. Overall, I think we're discouraging users from using inplace these days. It also kills what I think is the main use of assign: inside a chain of operations.
- Series could have an assign. I didn't include it yet since that would necessarily involve transforming a Series to a DataFrame, which we have the
to_framemethod for. - For now I'm taking the approach that people need to be very careful when using assign. I'm not doing any checking of the results to ensure that you're computation hasn't caused a reindexing that creates a bunch of NaNs.
I really think this should be called .update. Adding another function is just confusing.
@jreback I'm all for deprecating .update, but I think it is clearer to give them a distinct name, given that the behavior is different and also different from the update method on dicts.
Another name possibly worth considering is .set.
@shoyer we already have way too many update/set/filter etc methods. I think some consolidation is in order. I don't think adding another new method is useful at all.
And you dont' actually need to deprecate the current functionaility of .update. (e.g. other is a dict-like).
@jreback I agree on the problem but not the solution (in this specific case). Taking the deprecation and eventual removal of the current meaning of update as a given, I would rather have an assign method than an update method, because the later will always have the confusing association with duct.update (the issue is similar to the name mutate but worse). But is mostly bike shedding.
@jreback To follow up on your edit, I am pretty opposed to functions that do very different things depending on how you call them, except as a transitional step. That is poor API design :).
@shoyer fair enough, and I DO like .set :)
Travis is running. Will merge in a few hours, assuming no objections.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't need the parens here
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I call .head on the next line.
@TomAugspurger lgtm
- some minor doc comments
- pls open another issue (for 0.16.0 hopefully!) to add examples / docs for use of
.assignwith.groupby(), e.g.df.groupby('A').assign(B = lambda x: X.C+1).max(). I think you need the.assignto interact (and make a copy of the internalself.objof the grouper, but might be a bit trickier).
@jreback I'll need to think about how this interacts with groupby. E.G. say we wanted the group-wise mean of the ratio:
In [7]: gr.apply(lambda x: x.assign(r=x.sepal_width / x.sepal_length).mean())
Out[7]:
sepal_length sepal_width petal_length petal_width r
species
setosa 5.006 3.428 1.462 0.246 0.684248
versicolor 5.936 2.770 4.260 1.326 0.467680
virginica 6.588 2.974 5.552 2.026 0.453396
Obviously not really what we want; assign returns the entire frame. But assigning and then grouping works fine:
In [10]: df.assign(r=lambda x: x.sepal_length / x.sepal_width).groupby('species').r.mean() Out[10]: species setosa 1.470188 versicolor 2.160402 virginica 2.230453 Name: r, dtype: float64
I suppose it'd be needed for operations where the assign depends on some group-wise computation. Like df.groupby('species').apply(r_ = lambda x: (x.sepal_length / x.sepal_width * len(x)))?
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these import still needed? As you don't seem to use a plotting example anymore now?
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not needed now. Forgot to remove them.
I added two small doc remarks, for the rest, bombs away!
Creates a new method for DataFrame, based off dplyr's mutate. Closes pandas-dev#9229
TomAugspurger pushed a commit that referenced this pull request
API: Add DataFrame.assign method
Ok, thanks everyone. We can do follow-ups as needed.
Thanks. I'll follow up tonight.
On Mar 2, 2015, at 03:02, Joris Van den Bossche notifications@github.com wrote:
You can control this size in the image directive of rst (see http://docutils.sourceforge.net/docs/ref/rst/directives.html#image), the other option is just to include a smaller figure in the sources and include it as it is in the docs (this is what is done for the other images included in _static in the docs I think).
—
Reply to this email directly or view it on GitHub.
Yes, that didn't seem to work for some reason.
I opened up #9619 to fix this.