pyarrow.TableGroupBy — Apache Arrow v20.0.0 (original) (raw)
class pyarrow.TableGroupBy(table, keys, use_threads=True)#
Bases: object
A grouping of columns in a table on which to perform aggregations.
Parameters:
tablepyarrow.Table
Input table to execute the aggregation on.
Name of the grouped columns.
Whether to use multithreading or not. When set to True (the default), no stable ordering of the output is guaranteed.
Examples
import pyarrow as pa t = pa.table([ ... pa.array(["a", "a", "b", "b", "c"]), ... pa.array([1, 2, 3, 4, 5]), ... ], names=["keys", "values"])
Grouping of columns:
pa.TableGroupBy(t,"keys") <pyarrow.lib.TableGroupBy object at ...>
Perform aggregations:
pa.TableGroupBy(t,"keys").aggregate([("values", "sum")]) pyarrow.Table keys: string values_sum: int64
keys: [["a","b","c"]] values_sum: [[3,7,5]]
__init__(self, table, keys, use_threads=True)#
Methods
aggregate(self, aggregations)#
Perform an aggregation over the grouped columns of the table.
Parameters:
aggregationslist[tuple(str, str)] or list[tuple(str, str, FunctionOptions
)]
List of tuples, where each tuple is one aggregation specification and consists of: aggregation column name followed by function name and optionally aggregation function option. Pass empty list to get a single row for each group. The column name can be a string, an empty list or a list of column names, for unary, nullary and n-ary aggregation functions respectively.
For the list of function names and respective aggregation function options see Grouped Aggregations.
Returns:
Results of the aggregation functions.
Examples
import pyarrow as pa t = pa.table([ ... pa.array(["a", "a", "b", "b", "c"]), ... pa.array([1, 2, 3, 4, 5]), ... ], names=["keys", "values"])
Sum the column “values” over the grouped column “keys”:
t.group_by("keys").aggregate([("values", "sum")]) pyarrow.Table keys: string values_sum: int64
keys: [["a","b","c"]] values_sum: [[3,7,5]]
Count the rows over the grouped column “keys”:
t.group_by("keys").aggregate([([], "count_all")]) pyarrow.Table keys: string count_all: int64
keys: [["a","b","c"]] count_all: [[2,2,1]]
Do multiple aggregations:
t.group_by("keys").aggregate([ ... ("values", "sum"), ... ("keys", "count") ... ]) pyarrow.Table keys: string values_sum: int64 keys_count: int64
keys: [["a","b","c"]] values_sum: [[3,7,5]] keys_count: [[2,2,1]]
Count the number of non-null values for column “values” over the grouped column “keys”:
import pyarrow.compute as pc t.group_by(["keys"]).aggregate([ ... ("values", "count", pc.CountOptions(mode="only_valid")) ... ]) pyarrow.Table keys: string values_count: int64
keys: [["a","b","c"]] values_count: [[2,2,1]]
Get a single row for each group in column “keys”:
t.group_by("keys").aggregate([]) pyarrow.Table keys: string
keys: [["a","b","c"]]