API: Add DataFrame.assign method by TomAugspurger · Pull Request #9239 · pandas-dev/pandas (original) (raw)

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

Closes #9229

signature: DataFrame.transform(**kwargs)

the keyword is the name of the new column (existing columns are overwritten if there's a name conflict, as in dplyr)
the value is either
- called on self if it's callable. The callable should be a function of 1 argument, the DataFrame being called on.
- inserted otherwise

In [7]: df.head() Out[7]: sepal_length sepal_width petal_length petal_width species 0 5.1 3.5 1.4 0.2 setosa 1 4.9 3.0 1.4 0.2 setosa 2 4.7 3.2 1.3 0.2 setosa 3 4.6 3.1 1.5 0.2 setosa 4 5.0 3.6 1.4 0.2 setosa

In [8]: (df.query('species == "virginica"') .transform(sepal_ratio=lambda x: x.sepal_length / x.sepal_width) .head()) Out[8]: sepal_length sepal_width petal_length petal_width species
100 6.3 3.3 6.0 2.5 virginica
101 5.8 2.7 5.1 1.9 virginica
102 7.1 3.0 5.9 2.1 virginica
103 6.3 2.9 5.6 1.8 virginica
104 6.5 3.0 5.8 2.2 virginica

 sepal_ratio

100 1.909091
101 2.148148
102 2.366667
103 2.172414
104 2.166667

My question now is

How strict should we be on the shape of the transformed DataFrame? Should we do any kind of checking on the index or columns?

shoyer

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should do a shallow copy if possible.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we never shallow copy

much more trouble than it's worth
only indexes are shallow

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the unintentional formatting on that :)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I misunderstand pandas's datamodel, presuming that you could do a shallow copy and then update a column without modifying the original (like a dict). I see now that I was mistaken :(.

I like transform as the name; no-one in favor of mutate? any other possibilities?

only minus against transform is the use in groupby is somewhat different.

I like mutate better because:

we already have transform on groupby
we might want mutate as grouped operation -- dplyr has it

If you intend to keep the function pure (as it is currently) then the term mutate might be misleading.

Neither transform nor mutate are very descriptive. It may make sense to seek out a better term, even if it means breaking from tradition.

FWIW, dlpyr's mutate is also pure, despite the misleading name.

Augment? Enhance? I'll keep thinking.

On Jan 13, 2015, at 18:32, Matthew Rocklin notifications@github.com wrote:

le sigh

—
Reply to this email directly or view it on GitHub.

I think we should go a slightly different route here.

I would change the signature of df.update(*args, **kwargs). Then you can detect if its the 'new' mode or the 'original' mode (you can simply look at the args/kwargs and figure this out unambiguously).

I would vote to effectively deprecate the original .update mode (of course ATM this would just show a FutureWarning. This function is really not necessary except in very rare circumstances, and is not well implemented internally.

df.update(A=df.B/df.C) looks like a winner to me. (same idea in that it IS pure, but has a more suggestive name than transform/mutate. Further we can add the same method to groupby).

That sounds pretty clean.

I wanted to say that df.add_column() is descriptive, but ugly name .. but I like update more!
Only, the combination with the existing use seems a bit confusing to users I think (but have to look in more detail)

+1 for update.

I still vote for allowing multiple variables at once :).

OK, another idea: df.assign?

Update is not very useful in pandas, but it is part of the standard mapping API, where it's known as a method that does an in-place operation.

Thanks for the input. My favorite is df.assign, it conveys the meaning well. I could live with update though. I only worry about the cognitive clash with dict.update(), which is inplace.

I'll also allow multiple variables I think. I'm going to do all of the calculations before the assignment so that we don't run into issues with one calculation depending on another, and having the success or failure of the call dependent upon the dict ordering.

More tests and docs coming soon.

Updated with some docs and handling multiple assigns. I wasn't sure about the best place for the docs so I threw it in basics.rst. I still need to build them to make sure everything looks good.

I am worried about people hitting subtle bugs with assigning multiple columns in one assign since the order won't be preserved. I've tried to document that.

I've got a few more things to clean up and then I'll ping for review.

TomAugspurger changed the title~~API: Add DataFrame.transform method~~ API: Add DataFrame.assign method

Jan 18, 2015

Ok, ready for feedback.

Just a summary,

I went with assign, but update could work
keyword arguments only (potentially multiple)
if the value is callable, it's called on self
if the value is not callable, it's inserted

Nice feature, and some points to be considered:

inplace option like other functions (but it results we cannot create column named inplace though)
Should Series also have assign to make a DataFrame for consistency?
Better to care partial string slicing if DataFrame has DatetimeIndex and PeriodIndex, which outputs unexpected results.

I'm -1 on an inplace option. Overall, I think we're discouraging users from using inplace these days. It also kills what I think is the main use of assign: inside a chain of operations.
Series could have an assign. I didn't include it yet since that would necessarily involve transforming a Series to a DataFrame, which we have the to_frame method for.
For now I'm taking the approach that people need to be very careful when using assign. I'm not doing any checking of the results to ensure that you're computation hasn't caused a reindexing that creates a bunch of NaNs.

I really think this should be called .update. Adding another function is just confusing.

@jreback I'm all for deprecating .update, but I think it is clearer to give them a distinct name, given that the behavior is different and also different from the update method on dicts.

Another name possibly worth considering is .set.

@shoyer we already have way too many update/set/filter etc methods. I think some consolidation is in order. I don't think adding another new method is useful at all.

And you dont' actually need to deprecate the current functionaility of .update. (e.g. other is a dict-like).

@jreback I agree on the problem but not the solution (in this specific case). Taking the deprecation and eventual removal of the current meaning of update as a given, I would rather have an assign method than an update method, because the later will always have the confusing association with duct.update (the issue is similar to the name mutate but worse). But is mostly bike shedding.

@jreback To follow up on your edit, I am pretty opposed to functions that do very different things depending on how you call them, except as a transitional step. That is poor API design :).

@shoyer fair enough, and I DO like .set :)

Travis is running. Will merge in a few hours, assuming no objections.

jreback

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need the parens here

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I call .head on the next line.

@TomAugspurger lgtm

some minor doc comments
pls open another issue (for 0.16.0 hopefully!) to add examples / docs for use of .assign with .groupby() , e.g. df.groupby('A').assign(B = lambda x: X.C+1).max(). I think you need the .assign to interact (and make a copy of the internal self.obj of the grouper, but might be a bit trickier).

@jreback I'll need to think about how this interacts with groupby. E.G. say we wanted the group-wise mean of the ratio:

In [7]: gr.apply(lambda x: x.assign(r=x.sepal_width / x.sepal_length).mean())
Out[7]: 
            sepal_length  sepal_width  petal_length  petal_width         r
species                                                                   
setosa             5.006        3.428         1.462        0.246  0.684248
versicolor         5.936        2.770         4.260        1.326  0.467680
virginica          6.588        2.974         5.552        2.026  0.453396

Obviously not really what we want; assign returns the entire frame. But assigning and then grouping works fine:

In [10]: df.assign(r=lambda x: x.sepal_length / x.sepal_width).groupby('species').r.mean() Out[10]: species setosa 1.470188 versicolor 2.160402 virginica 2.230453 Name: r, dtype: float64

I suppose it'd be needed for operations where the assign depends on some group-wise computation. Like df.groupby('species').apply(r_ = lambda x: (x.sepal_length / x.sepal_width * len(x)))?

jorisvandenbossche

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these import still needed? As you don't seem to use a plotting example anymore now?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed now. Forgot to remove them.

I added two small doc remarks, for the rest, bombs away!

Creates a new method for DataFrame, based off dplyr's mutate. Closes pandas-dev#9229

TomAugspurger pushed a commit that referenced this pull request

Mar 1, 2015

API: Add DataFrame.assign method

Ok, thanks everyone. We can do follow-ups as needed.

Thanks. I'll follow up tonight.

On Mar 2, 2015, at 03:02, Joris Van den Bossche notifications@github.com wrote:

You can control this size in the image directive of rst (see http://docutils.sourceforge.net/docs/ref/rst/directives.html#image), the other option is just to include a smaller figure in the sources and include it as it is in the docs (this is what is done for the other images included in _static in the docs I think).

—
Reply to this email directly or view it on GitHub.

Yes, that didn't seem to work for some reason.

I opened up #9619 to fix this.

API: Add DataFrame.assign method by TomAugspurger · Pull Request #9239 · pandas-dev/pandas (original) (raw)

Conversation

Choose a reason for hiding this comment

Choose a reason for hiding this comment

we never shallow copy

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Labels