Add normalization to crosstab by nickeubank · Pull Request #12578 · pandas-dev/pandas (original) (raw)

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

@nickeubank

@sinhrks

Thanks for the PR:)

As pointed in #12569, I prefer adding new aggfunc to new option (normalize). For example, isn't agfunc="std" and normalize=True meaningless?

@nickeubank

@sinhrks Yes, combining an aggfunc with normalize would be a problem, which is why I added an exception for that.

(Or rather, I added two exceptions -- one for passing aggfunc without values which I think should raise an exception since when that happens aggfunc is ignored silently, and one for passing normalize with values. Together, they ensure no one tries to combine normalize with values or an aggfunc)

Since I can't think of an actual function you could pass to aggfunc that would do row or column normalization (unless I'm mistaken?), I'm not in favor of it being treated like it's just a special aggfunc. Since it's doing something you can't do with an aggfunc, seems like it should stand alone.

@jreback

@nickeubank, no you simply allow string args to aggfunc='normalize'or whatever, its less confusing (and you can of course always pass an actual function)

what @sinhrks suggests prob makes the most sense

pct, pct_row, pct_col; pretty clear what they mean (and the doc-string will reinforce these). these are convience aggregators.

jreback

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in any event this is just: table/x.sum(1). we almost never actually want to use .apply

@nickeubank

@nickeubank

jreback

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a versionadded tag

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may better to describe normalization is performed for the entire level (not for sub-levels) in MultiIndex case.

@sinhrks

I think it is nice if pct_xxx can normalize sum of values if values specified (and all of them are positive).

a = np.array([1, 1, 2, 2])
b = np.array([1, 2, 1, 2])
c = np.array([10, 20, 40, 80])
pd.crosstab(a, b, c, aggfunc=np.sum)
# col_0   1   2
# row_0        
# 1      10  20
# 2      40  80

pd.crosstab(a, b, c, aggfunc='pct_cols')
# col_0   1    2
# row_0        
# 1     0.2  0.2
# 2     0.8  0.8

@jorisvandenbossche

I am slightly -1 on adding this functionality to aggfunc:

@nickeubank

Thanks @jorvisvandenbosse -- I think that's well said. Perhaps we should solicit a few more opinions to see if we can move towards consensus on this?

@jreback

@jorisvandenbossche soln seems reasonable.

you might want to enable normalize=True -> pct, normalize='index' -> pct_rows, normalize='columns' -> 'pct_cols'

@nickeubank

That would be my preference

@jorisvandenbossche

you might want to enable normalize=True -> pct, normalize='index' -> pct_rows, normalize='columns' -> 'pct_cols'

+1

@sinhrks

My opinion is based on the understanding that crosstab is similar to pivot_table. Thus:

I don't have strong opposition against @jorisvandenbossche 's option. One concern is API gets complex more than required. It's less likely to normalize other values than count and sum.

@jorisvandenbossche

Because pivot_table accepts str, I think it is OK to add other str options to aggfunc

That's what I said above as well. crosstab's aggfunc also accepts strings like groupby().agg() accepts strings. But there it are strings that match an existing function/method. This is not the case with 'pct_cols/rows'.

One concern is API gets complex more than required.

Indeed, in general I am very reluctant in adding new keyword arguments. But in this case, I think it makes it more complex to explain what aggfunc does in case of combining it. It seems cleaner to separate the two notions

It's less likely to normalize other values than count and sum.

That's indeed true.

@nickeubank

Sounds like we're agreed aggfunc can take strings (I'll remove that exception!)

I think my main two thoughts are:

Regarding complexity, I think of crosstab as a tool that may be used to generate analysis outputs to potentially put in papers (that's my interest at least). Given that, I think making it as flexible and powerful is highly desirable, even at the cost of an extra key word.

@jreback

so we are talking about doing these ops:

In [20]: r = pd.crosstab(df.a,df.b)

In [21]: r
Out[21]: 
b  3  4
a      
1  1  0
2  1  3

In [22]: r/r.sum()
Out[22]: 
b    3    4
a          
1  0.5  0.0
2  0.5  1.0

In [23]: r/r.sum().sum()
Out[23]: 
b    3    4
a          
1  0.2  0.0
2  0.2  0.6

In [24]: r.div(r.sum(1),axis='index')
Out[24]: 
b     3     4
a            
1  1.00  0.00
2  0.25  0.75

This kind of feels like a post-processing step rather than a result of a single operation.

@jorisvandenbossche

You can indeed rather simple do this after crosstab (although the row-wise case is not that trivial anymore), but still I would be OK to add a convenience keyword for this.

@jreback

yeah, just thinking if we have a kw (which is fine), then should explain what it is actually doing in the doc-string. As not completely obvious.

another option, is maybe to actually have a .normalize(...) function on frames.

pd.crosstab(df.a, df.b).normalize('columns')

@nickeubank

Yes, it's post-processing. Indeed, that's how it's implemented.

The one complication with a stand-alone normalize is that it can't be easily designed to deal with the margin columns -- if you do crosstab(df.a, df.b, margins=True) then you have to do a special normalization procedure to deal with the fact that the bottom row is a sum of all the rows above. A method that just divides all values by the total sum, you end up off by a factor of 1/2. I really thing there's value in having this as an integrated method.

@jreback

@nickeubank well one could argue the margins itself should be an external method as well! e.g. its kind of a 'footer' concept almost like a collection of DataFrame + footer Series + footer Series (rhs).

So that's really another abstraction and just being shoved into the current one.

@nickeubank

@jreback Think addressed all your concerns. Only think missing is that I wasn't sure how to write some of the requested tests for dropna=False because I'm a little confused about expected behavior (see new comment on #10772).

jreback

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something this, just do in a list, as the tests are more clear

for arg in ['index', True, 'columns']:
     result = ....
     tm.assert_frame.....

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this too

@jreback

jreback

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why don't you add a mini-example here (same one)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback. Thanks. I'm going to be traveling away from my computer for a
little over two weeks. I'm not sure what the timing is for 0.18.1. I'm
happy to make changes when I get back, but won't be able to do anything
till then.
On Mon, Apr 4, 2016 at 10:59 AM Jeff Reback notifications@github.com
wrote:

In doc/source/whatsnew/v0.18.1.txt
#12578 (comment):

@@ -18,8 +18,7 @@ Highlights include:
New features


-
-
+- ``pd.crosstab()`` has gained ``normalize`` argument for normalizing frequency tables (:issue:`12569`). Examples in updated docs :ref:`here <reshaping.crosstabulations>`.

why don't you add a mini-example here (same one)


You are receiving this because you were mentioned.

Reply to this email directly or view it on GitHub
https://github.com/pydata/pandas/pull/12578/files/39718b8f4f398d9cdf6c38583453c804675a5e6a#r58418709

@jreback

@nickeubank when you are back, pls rebase. This looked pretty good.

@nickeubank

@nickeubank

jorisvandenbossche

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very small detail, but I think there is one space too much at the beginning of this line

@nickeubank

sinhrks

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never used, correct?

@nickeubank

@jreback

Labels