API: Add DataFrame.assign method by TomAugspurger · Pull Request #9239 · pandas-dev/pandas (original) (raw)

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

@TomAugspurger

Closes #9229

signature: DataFrame.transform(**kwargs)

In [7]: df.head() Out[7]: sepal_length sepal_width petal_length petal_width species 0 5.1 3.5 1.4 0.2 setosa 1 4.9 3.0 1.4 0.2 setosa 2 4.7 3.2 1.3 0.2 setosa 3 4.6 3.1 1.5 0.2 setosa 4 5.0 3.6 1.4 0.2 setosa

In [8]: (df.query('species == "virginica"') .transform(sepal_ratio=lambda x: x.sepal_length / x.sepal_width) .head()) Out[8]: sepal_length sepal_width petal_length petal_width species
100 6.3 3.3 6.0 2.5 virginica
101 5.8 2.7 5.1 1.9 virginica
102 7.1 3.0 5.9 2.1 virginica
103 6.3 2.9 5.6 1.8 virginica
104 6.5 3.0 5.8 2.2 virginica

 sepal_ratio  

100 1.909091
101 2.148148
102 2.366667
103 2.172414
104 2.166667

My question now is

shoyer

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should do a shallow copy if possible.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we never shallow copy

much more trouble than it's worth
only indexes are shallow

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the unintentional formatting on that :)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I misunderstand pandas's datamodel, presuming that you could do a shallow copy and then update a column without modifying the original (like a dict). I see now that I was mistaken :(.

@jreback

I like transform as the name; no-one in favor of mutate? any other possibilities?

only minus against transform is the use in groupby is somewhat different.

@shoyer

I like mutate better because:

  1. we already have transform on groupby
  2. we might want mutate as grouped operation -- dplyr has it

@mrocklin

If you intend to keep the function pure (as it is currently) then the term mutate might be misleading.

Neither transform nor mutate are very descriptive. It may make sense to seek out a better term, even if it means breaking from tradition.

@shoyer

FWIW, dlpyr's mutate is also pure, despite the misleading name.

@mrocklin

@TomAugspurger

Augment? Enhance? I'll keep thinking.

On Jan 13, 2015, at 18:32, Matthew Rocklin notifications@github.com wrote:

le sigh


Reply to this email directly or view it on GitHub.

@jreback

I think we should go a slightly different route here.

I would change the signature of df.update(*args, **kwargs). Then you can detect if its the 'new' mode or the 'original' mode (you can simply look at the args/kwargs and figure this out unambiguously).

I would vote to effectively deprecate the original .update mode (of course ATM this would just show a FutureWarning. This function is really not necessary except in very rare circumstances, and is not well implemented internally.

df.update(A=df.B/df.C) looks like a winner to me. (same idea in that it IS pure, but has a more suggestive name than transform/mutate. Further we can add the same method to groupby).

@mrocklin

That sounds pretty clean.

@jorisvandenbossche

I wanted to say that df.add_column() is descriptive, but ugly name .. but I like update more!
Only, the combination with the existing use seems a bit confusing to users I think (but have to look in more detail)

@shoyer

+1 for update.

I still vote for allowing multiple variables at once :).

@shoyer

OK, another idea: df.assign?

Update is not very useful in pandas, but it is part of the standard mapping API, where it's known as a method that does an in-place operation.

@TomAugspurger

Thanks for the input. My favorite is df.assign, it conveys the meaning well. I could live with update though. I only worry about the cognitive clash with dict.update(), which is inplace.

I'll also allow multiple variables I think. I'm going to do all of the calculations before the assignment so that we don't run into issues with one calculation depending on another, and having the success or failure of the call dependent upon the dict ordering.

More tests and docs coming soon.

@TomAugspurger

Updated with some docs and handling multiple assigns. I wasn't sure about the best place for the docs so I threw it in basics.rst. I still need to build them to make sure everything looks good.

I am worried about people hitting subtle bugs with assigning multiple columns in one assign since the order won't be preserved. I've tried to document that.

I've got a few more things to clean up and then I'll ping for review.

@TomAugspurger TomAugspurger changed the titleAPI: Add DataFrame.transform method API: Add DataFrame.assign method

Jan 18, 2015

@TomAugspurger

Ok, ready for feedback.

Just a summary,

@sinhrks

Nice feature, and some points to be considered:

@TomAugspurger

@jreback

I really think this should be called .update. Adding another function is just confusing.

@shoyer

@jreback I'm all for deprecating .update, but I think it is clearer to give them a distinct name, given that the behavior is different and also different from the update method on dicts.

Another name possibly worth considering is .set.

@jreback

@shoyer we already have way too many update/set/filter etc methods. I think some consolidation is in order. I don't think adding another new method is useful at all.

And you dont' actually need to deprecate the current functionaility of .update. (e.g. other is a dict-like).

@shoyer

@jreback I agree on the problem but not the solution (in this specific case). Taking the deprecation and eventual removal of the current meaning of update as a given, I would rather have an assign method than an update method, because the later will always have the confusing association with duct.update (the issue is similar to the name mutate but worse). But is mostly bike shedding.

@shoyer

@jreback To follow up on your edit, I am pretty opposed to functions that do very different things depending on how you call them, except as a transitional step. That is poor API design :).

@jreback

@shoyer fair enough, and I DO like .set :)

@TomAugspurger

Travis is running. Will merge in a few hours, assuming no objections.

jreback

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need the parens here

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I call .head on the next line.

@jreback

@TomAugspurger lgtm

@TomAugspurger

@jreback I'll need to think about how this interacts with groupby. E.G. say we wanted the group-wise mean of the ratio:

In [7]: gr.apply(lambda x: x.assign(r=x.sepal_width / x.sepal_length).mean())
Out[7]: 
            sepal_length  sepal_width  petal_length  petal_width         r
species                                                                   
setosa             5.006        3.428         1.462        0.246  0.684248
versicolor         5.936        2.770         4.260        1.326  0.467680
virginica          6.588        2.974         5.552        2.026  0.453396

Obviously not really what we want; assign returns the entire frame. But assigning and then grouping works fine:

In [10]: df.assign(r=lambda x: x.sepal_length / x.sepal_width).groupby('species').r.mean() Out[10]: species setosa 1.470188 versicolor 2.160402 virginica 2.230453 Name: r, dtype: float64

I suppose it'd be needed for operations where the assign depends on some group-wise computation. Like df.groupby('species').apply(r_ = lambda x: (x.sepal_length / x.sepal_width * len(x)))?

@TomAugspurger

jorisvandenbossche

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these import still needed? As you don't seem to use a plotting example anymore now?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed now. Forgot to remove them.

@jorisvandenbossche

I added two small doc remarks, for the rest, bombs away!

Creates a new method for DataFrame, based off dplyr's mutate. Closes pandas-dev#9229

TomAugspurger pushed a commit that referenced this pull request

Mar 1, 2015

API: Add DataFrame.assign method

@TomAugspurger

Ok, thanks everyone. We can do follow-ups as needed.

@shoyer

@jreback

@jorisvandenbossche

@TomAugspurger

Thanks. I'll follow up tonight.

On Mar 2, 2015, at 03:02, Joris Van den Bossche notifications@github.com wrote:

You can control this size in the image directive of rst (see http://docutils.sourceforge.net/docs/ref/rst/directives.html#image), the other option is just to include a smaller figure in the sources and include it as it is in the docs (this is what is done for the other images included in _static in the docs I think).


Reply to this email directly or view it on GitHub.

@jreback

@TomAugspurger

@jreback

@shoyer

Yes, that didn't seem to work for some reason.

@TomAugspurger

I opened up #9619 to fix this.

Labels