Add normalization to crosstab by nickeubank · Pull Request #12578 · pandas-dev/pandas (original) (raw)
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
[ Show hidden characters]({{ revealButtonHref }})
Thanks for the PR:)
As pointed in #12569, I prefer adding new aggfunc to new option (normalize). For example, isn't agfunc="std" and normalize=True meaningless?
@sinhrks Yes, combining an aggfunc with normalize would be a problem, which is why I added an exception for that.
(Or rather, I added two exceptions -- one for passing aggfunc without values which I think should raise an exception since when that happens aggfunc is ignored silently, and one for passing normalize with values. Together, they ensure no one tries to combine normalize with values or an aggfunc)
Since I can't think of an actual function you could pass to aggfunc that would do row or column normalization (unless I'm mistaken?), I'm not in favor of it being treated like it's just a special aggfunc. Since it's doing something you can't do with an aggfunc, seems like it should stand alone.
@nickeubank, no you simply allow string args to aggfunc='normalize'or whatever, its less confusing (and you can of course always pass an actual function)
what @sinhrks suggests prob makes the most sense
pct, pct_row, pct_col; pretty clear what they mean (and the doc-string will reinforce these). these are convience aggregators.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in any event this is just: table/x.sum(1). we almost never actually want to use .apply
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a versionadded tag
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may better to describe normalization is performed for the entire level (not for sub-levels) in MultiIndex case.
I think it is nice if pct_xxx can normalize sum of values if values specified (and all of them are positive).
a = np.array([1, 1, 2, 2])
b = np.array([1, 2, 1, 2])
c = np.array([10, 20, 40, 80])
pd.crosstab(a, b, c, aggfunc=np.sum)
# col_0 1 2
# row_0
# 1 10 20
# 2 40 80
pd.crosstab(a, b, c, aggfunc='pct_cols')
# col_0 1 2
# row_0
# 1 0.2 0.2
# 2 0.8 0.8
I am slightly -1 on adding this functionality to aggfunc:
- First, having both the possibility of passing a func or a string (that not maps directly to an existing function) seems a bit unpythonic, and it is really about two different cases: how to aggregate multiple values, or whether or not to normalize the final result.
- Currently, you can already pass strings to
aggfuncto specify the aggregation function (like in groupby.agg), so the current implementation checking for those specific strings will break that functionality.
Eg something likepd.crosstab(index, columns, values, aggfunc='sum') value_countsalready has thenormalizekeyword, so there is a precedent for this (and I think it is an equivalent functionality, so would be more consistent to use that?)- as @sinhrks just mentions, you may also want to normalize the result of an aggregation (eg
pd.crosstab(index, columns, values, aggfunc='sum', normalize=True). That would not be possible with the current proposed API.
Thanks @jorvisvandenbosse -- I think that's well said. Perhaps we should solicit a few more opinions to see if we can move towards consensus on this?
@jorisvandenbossche soln seems reasonable.
you might want to enable normalize=True -> pct, normalize='index' -> pct_rows, normalize='columns' -> 'pct_cols'
That would be my preference
you might want to enable normalize=True -> pct, normalize='index' -> pct_rows, normalize='columns' -> 'pct_cols'
+1
My opinion is based on the understanding that crosstab is similar to pivot_table. Thus:
- Because
pivot_tableacceptsstr, I think it is OK to add otherstroptions toaggfunc. aggfunc="pct", "pct_cols", "pct_rows"outputs normalize counts if values is not specified, and normalize values if values is specified.- I don't distinguish aggregate/normalize because
aggfunccan be defined fornormalization. (meanis(category sum)/(category counts),pctis(category sum)/(total counts))
I don't have strong opposition against @jorisvandenbossche 's option. One concern is API gets complex more than required. It's less likely to normalize other values than count and sum.
Because pivot_table accepts str, I think it is OK to add other str options to aggfunc
That's what I said above as well. crosstab's aggfunc also accepts strings like groupby().agg() accepts strings. But there it are strings that match an existing function/method. This is not the case with 'pct_cols/rows'.
One concern is API gets complex more than required.
Indeed, in general I am very reluctant in adding new keyword arguments. But in this case, I think it makes it more complex to explain what aggfunc does in case of combining it. It seems cleaner to separate the two notions
It's less likely to normalize other values than count and sum.
That's indeed true.
Sounds like we're agreed aggfunc can take strings (I'll remove that exception!)
I think my main two thoughts are:
- There is no function that could do row-normalization. Thus, since it's functionally doing something different than any explicit
aggfunccould do, it should be it's own option, even though I agree with @sinhrks there are few situations when you'd want to normalize the output of a crosstab generated with a customaggfunc. Indeed, other programs like Stata that do something similar (accept aggregation functions for table) also split out normalization into its own syntax. - As @jorisvandenbossche has pointed out,
normalizeis already a keyword in a core function (value_counts), so I think it's pretty API consistent to have.
Regarding complexity, I think of crosstab as a tool that may be used to generate analysis outputs to potentially put in papers (that's my interest at least). Given that, I think making it as flexible and powerful is highly desirable, even at the cost of an extra key word.
so we are talking about doing these ops:
In [20]: r = pd.crosstab(df.a,df.b)
In [21]: r
Out[21]:
b 3 4
a
1 1 0
2 1 3
In [22]: r/r.sum()
Out[22]:
b 3 4
a
1 0.5 0.0
2 0.5 1.0
In [23]: r/r.sum().sum()
Out[23]:
b 3 4
a
1 0.2 0.0
2 0.2 0.6
In [24]: r.div(r.sum(1),axis='index')
Out[24]:
b 3 4
a
1 1.00 0.00
2 0.25 0.75
This kind of feels like a post-processing step rather than a result of a single operation.
You can indeed rather simple do this after crosstab (although the row-wise case is not that trivial anymore), but still I would be OK to add a convenience keyword for this.
yeah, just thinking if we have a kw (which is fine), then should explain what it is actually doing in the doc-string. As not completely obvious.
another option, is maybe to actually have a .normalize(...) function on frames.
pd.crosstab(df.a, df.b).normalize('columns')
Yes, it's post-processing. Indeed, that's how it's implemented.
The one complication with a stand-alone normalize is that it can't be easily designed to deal with the margin columns -- if you do crosstab(df.a, df.b, margins=True) then you have to do a special normalization procedure to deal with the fact that the bottom row is a sum of all the rows above. A method that just divides all values by the total sum, you end up off by a factor of 1/2. I really thing there's value in having this as an integrated method.
@nickeubank well one could argue the margins itself should be an external method as well! e.g. its kind of a 'footer' concept almost like a collection of DataFrame + footer Series + footer Series (rhs).
So that's really another abstraction and just being shoved into the current one.
@jreback Think addressed all your concerns. Only think missing is that I wasn't sure how to write some of the requested tests for dropna=False because I'm a little confused about expected behavior (see new comment on #10772).
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
something this, just do in a list, as the tests are more clear
for arg in ['index', True, 'columns']:
result = ....
tm.assert_frame.....
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this too
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why don't you add a mini-example here (same one)
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback. Thanks. I'm going to be traveling away from my computer for a
little over two weeks. I'm not sure what the timing is for 0.18.1. I'm
happy to make changes when I get back, but won't be able to do anything
till then.
On Mon, Apr 4, 2016 at 10:59 AM Jeff Reback notifications@github.com
wrote:
In doc/source/whatsnew/v0.18.1.txt
#12578 (comment):@@ -18,8 +18,7 @@ Highlights include:
New features- - +- ``pd.crosstab()`` has gained ``normalize`` argument for normalizing frequency tables (:issue:`12569`). Examples in updated docs :ref:`here <reshaping.crosstabulations>`.why don't you add a mini-example here (same one)
—
You are receiving this because you were mentioned.Reply to this email directly or view it on GitHub
https://github.com/pydata/pandas/pull/12578/files/39718b8f4f398d9cdf6c38583453c804675a5e6a#r58418709
@nickeubank when you are back, pls rebase. This looked pretty good.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very small detail, but I think there is one space too much at the beginning of this line
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Never used, correct?