pyarrow.TableGroupBy — Apache Arrow v20.0.0 (original) (raw)

class pyarrow.TableGroupBy(table, keys, use_threads=True)#

Bases: object

A grouping of columns in a table on which to perform aggregations.

Parameters:

tablepyarrow.Table

Input table to execute the aggregation on.

keysstr or list[str]

Name of the grouped columns.

use_threadsbool, default True

Whether to use multithreading or not. When set to True (the default), no stable ordering of the output is guaranteed.

Examples

import pyarrow as pa t = pa.table([ ... pa.array(["a", "a", "b", "b", "c"]), ... pa.array([1, 2, 3, 4, 5]), ... ], names=["keys", "values"])

Grouping of columns:

pa.TableGroupBy(t,"keys") <pyarrow.lib.TableGroupBy object at ...>

Perform aggregations:

pa.TableGroupBy(t,"keys").aggregate([("values", "sum")]) pyarrow.Table keys: string values_sum: int64

keys: [["a","b","c"]] values_sum: [[3,7,5]]

__init__(self, table, keys, use_threads=True)#

Methods

aggregate(self, aggregations)#

Perform an aggregation over the grouped columns of the table.

Parameters:

aggregationslist[tuple(str, str)] or list[tuple(str, str, FunctionOptions)]

List of tuples, where each tuple is one aggregation specification and consists of: aggregation column name followed by function name and optionally aggregation function option. Pass empty list to get a single row for each group. The column name can be a string, an empty list or a list of column names, for unary, nullary and n-ary aggregation functions respectively.

For the list of function names and respective aggregation function options see Grouped Aggregations.

Returns:

Table

Results of the aggregation functions.

Examples

import pyarrow as pa t = pa.table([ ... pa.array(["a", "a", "b", "b", "c"]), ... pa.array([1, 2, 3, 4, 5]), ... ], names=["keys", "values"])

Sum the column “values” over the grouped column “keys”:

t.group_by("keys").aggregate([("values", "sum")]) pyarrow.Table keys: string values_sum: int64

keys: [["a","b","c"]] values_sum: [[3,7,5]]

Count the rows over the grouped column “keys”:

t.group_by("keys").aggregate([([], "count_all")]) pyarrow.Table keys: string count_all: int64

keys: [["a","b","c"]] count_all: [[2,2,1]]

Do multiple aggregations:

t.group_by("keys").aggregate([ ... ("values", "sum"), ... ("keys", "count") ... ]) pyarrow.Table keys: string values_sum: int64 keys_count: int64

keys: [["a","b","c"]] values_sum: [[3,7,5]] keys_count: [[2,2,1]]

Count the number of non-null values for column “values” over the grouped column “keys”:

import pyarrow.compute as pc t.group_by(["keys"]).aggregate([ ... ("values", "count", pc.CountOptions(mode="only_valid")) ... ]) pyarrow.Table keys: string values_count: int64

keys: [["a","b","c"]] values_count: [[2,2,1]]

Get a single row for each group in column “keys”:

t.group_by("keys").aggregate([]) pyarrow.Table keys: string

keys: [["a","b","c"]]