ENH: Add pipe method to GroupBy (fixes #10353) by ghl3 · Pull Request #10466 · pandas-dev/pandas (original) (raw)

Okay, sure. I'll describe it here before coding it up to ensure we're all on the same page.

There are two examples that come to mind. The first is a calculation of "gini_impurity", which is a common function used in statistics and classification (many of the examples in my head are related to classification since it involves discrete groups, and I find DataFrameGroupBys to be useful data structures in such problems). The other is a more general example of printing or plotting a summary of grouped dataframe:

import pandas as pd

df = pd.DataFrame({'class': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C'],
                                 'weight': [1, 2, 3, 4, 3, 6, 7, 4]})

def gini_impurity(dfgb):

    tot = dfgb.weight.sum().sum()

    p2 = 0
    for _, group in dfgb:
        p2 += (group.weight.sum() / tot) ** 2

return 1.0 - p2

# Example A
df.groupby('A').pipe(gini_impurity)

def save_description_of_groups(dfgb, path):
    with open(path, 'w+') as f:
        for name, group in dfgb:
            f.write("{}:\n".format(name))
            f.write(group.describe().to_string())
            f.write("\n\n")

# Example B
df.groupby('A').pipe(save_description_of_groups, 'description.txt')

My initial motivation for not using these was that the first one is a bit domain specific and the second one complicates things by doing file I/O. It initially seemed that something simpler and more mathematical would be more appropriate for general documentation, but wanting something that's better motivated and more "real-world" certainly makes sense to me.

Would cleaned-up versions of either of the above be better, or at least are they on the right track?