groupsummary - Group summary computations - MATLAB (original) (raw)

Group summary computations

Syntax

Description

Table Data

[G](#mw%5F49a30f9a-957b-4a9f-8204-aab6afc47002) = groupsummary([T](#d126e702093),[groupvars](#mw%5Fcb7391b0-8e28-4f07-a6be-7fda47cf8c95)) returns the unique grouping variable combinations and the number of members in each group for table or timetable T. Groups are defined by rows in the variables in groupvars that have the same unique combination of values. Each row of the output table corresponds to one group. For example, G = groupsummary(T,"HealthStatus") returns a table with the count of each group in the variableHealthStatus.

You can use groupsummary functionality interactively by adding the Compute by Group task to a live script.

For more information, see Group Summary Computation.

example

[G](#mw%5F49a30f9a-957b-4a9f-8204-aab6afc47002) = groupsummary([T](#d126e702093),[groupvars](#mw%5Fcb7391b0-8e28-4f07-a6be-7fda47cf8c95),[method](#d126e702341)) applies the group-wise computations specified in method and appends the computation results to the output table as additional variables. For example, G = groupsummary(T,"Location","median") returns the median value of every nongrouping variable in T for each location, in addition to the number of members in each location group.

[G](#mw%5F49a30f9a-957b-4a9f-8204-aab6afc47002) = groupsummary([T](#d126e702093),[groupvars](#mw%5Fcb7391b0-8e28-4f07-a6be-7fda47cf8c95),[method](#d126e702341),[datavars](#d126e702538)) specifies the table variables to apply the method to.

example

[G](#mw%5F49a30f9a-957b-4a9f-8204-aab6afc47002) = groupsummary([T](#d126e702093),[groupvars](#mw%5Fcb7391b0-8e28-4f07-a6be-7fda47cf8c95),[groupbins](#mw%5Fdf41489c-3333-489c-9e4e-907e7d2e2f99)) specifies to bin rows in groupvars according to binning scheme groupbins prior to grouping. For example, G = groupsummary(T,"SaleDate","year") returns the group counts for all sales in T within each year according to the grouping variable SaleDate.

[G](#mw%5F49a30f9a-957b-4a9f-8204-aab6afc47002) = groupsummary([T](#d126e702093),[groupvars](#mw%5Fcb7391b0-8e28-4f07-a6be-7fda47cf8c95),[groupbins](#mw%5Fdf41489c-3333-489c-9e4e-907e7d2e2f99),[method](#d126e702341)) specifies how to bin the data in groupvars prior to grouping and applying the group-wise computations specified inmethod.

example

[G](#mw%5F49a30f9a-957b-4a9f-8204-aab6afc47002) = groupsummary([T](#d126e702093),[groupvars](#mw%5Fcb7391b0-8e28-4f07-a6be-7fda47cf8c95),[groupbins](#mw%5Fdf41489c-3333-489c-9e4e-907e7d2e2f99),[method](#d126e702341),[datavars](#d126e702538)) specifies the table variables to apply the method to.

[G](#mw%5F49a30f9a-957b-4a9f-8204-aab6afc47002) = groupsummary(___,[Name,Value](#namevaluepairarguments)) specifies additional grouping properties using one or more name-value arguments for any of the previous syntaxes. For example, G = groupsummary(T,"Category1","IncludeMissingGroups",false) excludes the group made from missing data of type categorical indicated by <undefined> inCategory1.

example

Array Data

[B](#mw%5F53d5bda6-93ee-4341-ba1f-aefeac635de3) = groupsummary([A](#mw%5F73097462-ba93-4219-a9ea-0394617c586f),[groupvars](#mw%5Fcb7391b0-8e28-4f07-a6be-7fda47cf8c95),[method](#d126e702341)) returns the concatenated results of applying the group-wise computations inmethod to unique groups in vector, matrix, or cell arrayA. Groups are defined by rows in the column vectors ingroupvars that have the same unique combination of values. Each row of the output array contains the computation results for one group.

You can use groupsummary functionality interactively by adding the Compute by Group task to a live script.

[B](#mw%5F53d5bda6-93ee-4341-ba1f-aefeac635de3) = groupsummary([A](#mw%5F73097462-ba93-4219-a9ea-0394617c586f),[groupvars](#mw%5Fcb7391b0-8e28-4f07-a6be-7fda47cf8c95),[groupbins](#mw%5Fdf41489c-3333-489c-9e4e-907e7d2e2f99),[method](#d126e702341)) specifies to bin rows in groupvars according to binning scheme groupbins prior to grouping.

example

[B](#mw%5F53d5bda6-93ee-4341-ba1f-aefeac635de3) = groupsummary(___,[Name,Value](#namevaluepairarguments)) specifies additional grouping properties using one or more name-value arguments for either of the previous syntaxes for an input array.

example

[[B](#mw%5F53d5bda6-93ee-4341-ba1f-aefeac635de3),[BG](#mw%5Fcd3e0679-e65b-4813-9826-6fe61964f107),[BC](#mw%5F3bbea317-e5a0-4f68-8bfc-d02299c0b7ac)] = groupsummary([A](#mw%5F73097462-ba93-4219-a9ea-0394617c586f),___) returns additional group information. BG is the unique grouping vector combinations corresponding to the rows in B.BC is the number of members in each group inBG.

example

Examples

collapse all

Compute summary statistics on table variables.

Create a table T that contains information about eight individuals.

HealthStatus = categorical(["Poor"; "Good"; "Fair"; "Fair"; "Poor"; "Excellent"; "Good"; "Excellent"]); Age = [38; 43; 38; 40; 49; 51; 52; 35]; Height = [71; 68; 64; 67; 64; 62; 65; 55]; Weight = [176; 153; 131; 133; 119; 120; 140; 129]; T = table(HealthStatus,Age,Height,Weight)

T=8×4 table HealthStatus Age Height Weight ____________ ___ ______ ______

 Poor           38       71       176  
 Good           43       68       153  
 Fair           38       64       131  
 Fair           40       67       133  
 Poor           49       64       119  
 Excellent      51       62       120  
 Good           52       65       140  
 Excellent      35       55       129  

Compute the counts of the health status groups by specifying HealthStatus as the grouping variable.

G = groupsummary(T,"HealthStatus")

G=4×2 table HealthStatus GroupCount ____________ __________

 Excellent          2     
 Fair               2     
 Good               2     
 Poor               2     

Compute the mean age, height, and weight of each health status group.

G = groupsummary(T,"HealthStatus","mean")

G=4×5 table HealthStatus GroupCount mean_Age mean_Height mean_Weight ____________ __________ ________ ___________ ___________

 Excellent          2             43         58.5           124.5   
 Fair               2             39         65.5             132   
 Good               2           47.5         66.5           146.5   
 Poor               2           43.5         67.5           147.5   

Still grouping by health status, compute only the median height.

G = groupsummary(T,"HealthStatus","median","Height")

G=4×3 table HealthStatus GroupCount median_Height ____________ __________ _____________

 Excellent          2             58.5     
 Fair               2             65.5     
 Good               2             66.5     
 Poor               2             67.5     

Group table data using two grouping variables.

Create a table T that contains information about eight individuals.

HealthStatus = categorical(["Poor"; "Good"; "Fair"; "Fair"; "Poor"; "Excellent"; "Good"; "Excellent"]); Smoker = logical([1; 0; 0; 1; 1; 0; 0; 1]); Weight = [176; 153; 131; 133; 119; 120; 140; 129]; T = table(HealthStatus,Smoker,Weight)

T=8×3 table HealthStatus Smoker Weight ____________ ______ ______

 Poor           true       176  
 Good           false      153  
 Fair           false      131  
 Fair           true       133  
 Poor           true       119  
 Excellent      false      120  
 Good           false      140  
 Excellent      true       129  

Compute the mean weight, grouped by health status and smoking status. By default, some combinations of health status and smoking status are not represented in the output because they are empty groups.

G = groupsummary(T,["HealthStatus","Smoker"],"mean","Weight")

G=6×4 table HealthStatus Smoker GroupCount mean_Weight ____________ ______ __________ ___________

 Excellent      false         1              120   
 Excellent      true          1              129   
 Fair           false         1              131   
 Fair           true          1              133   
 Good           false         2            146.5   
 Poor           true          2            147.5   

Set the value of IncludeEmptyGroups to true to see all group combinations, including the empty ones.

G = groupsummary(T,["HealthStatus","Smoker"],"mean","Weight","IncludeEmptyGroups",true)

G=8×4 table HealthStatus Smoker GroupCount mean_Weight ____________ ______ __________ ___________

 Excellent      false         1              120   
 Excellent      true          1              129   
 Fair           false         1              131   
 Fair           true          1              133   
 Good           false         2            146.5   
 Good           true          0              NaN   
 Poor           false         0              NaN   
 Poor           true          2            147.5   

Group data according to specified bins.

Create a timetable that contains sales information for days within a single month.

TimeStamps = datetime([2017 3 4; 2017 3 2; 2017 3 15; 2017 3 10; ... 2017 3 14; 2017 3 31; 2017 3 25; ... 2017 3 29; 2017 3 21; 2017 3 18]); Profit = [2032 3071 1185 2587 1998 2899 3112 909 2619 3085]'; ItemsSold = [14 13 8 5 10 16 8 6 7 11]'; TT = timetable(TimeStamps,Profit,ItemsSold)

TT=10×2 timetable TimeStamps Profit ItemsSold ___________ ______ _________

04-Mar-2017     2032        14    
02-Mar-2017     3071        13    
15-Mar-2017     1185         8    
10-Mar-2017     2587         5    
14-Mar-2017     1998        10    
31-Mar-2017     2899        16    
25-Mar-2017     3112         8    
29-Mar-2017      909         6    
21-Mar-2017     2619         7    
18-Mar-2017     3085        11    

Compute the mean and the mode of profit binned by the items sold, binning the groups into intervals of item numbers.

format shorte G = groupsummary(TT,"ItemsSold",[0 4 8 12 16],{"mean","mode"},"Profit")

G=3×4 table disc_ItemsSold GroupCount mean_Profit mode_Profit ______________ __________ ___________ ___________

   [4, 8)         3.0000e+00    2.0383e+03     9.0900e+02 
   [8, 12)        4.0000e+00    2.3450e+03     1.1850e+03 
   [12, 16]       3.0000e+00    2.6673e+03     2.0320e+03 

Compute the mean profit grouped by day of the week.

G = groupsummary(TT,"TimeStamps","dayname","mean","Profit")

G=5×3 table dayname_TimeStamps GroupCount mean_Profit __________________ __________ ___________

    Tuesday           2.0000e+00    2.3085e+03 
    Wednesday         2.0000e+00    1.0470e+03 
    Thursday          1.0000e+00    3.0710e+03 
    Friday            2.0000e+00    2.7430e+03 
    Saturday          3.0000e+00    2.7430e+03 

Create a vector of dates and a vector of corresponding profit values.

TimeStamps = datetime([2017 3 4; 2017 3 2; 2017 3 15; 2017 3 10; ... 2017 3 14; 2017 3 31; 2017 3 25; ... 2017 3 29; 2017 3 21; 2017 3 18]); Profit = [2032 3071 1185 2587 1998 2899 3112 909 2619 3085]';

Compute the mean profit for each day of the week. Display the mean, the group names, and the number of members in each group.

format shorte [meanDailyProfit,dayOfWeek,dailyCounts] = groupsummary(Profit,TimeStamps,"dayname","mean")

meanDailyProfit = 5×1

2.3085e+03 1.0470e+03 3.0710e+03 2.7430e+03 2.7430e+03

dayOfWeek = 5×1 categorical Tuesday Wednesday Thursday Friday Saturday

dailyCounts = 5×1

 2
 2
 1
 2
 3

Compute the mean weights for a set of people grouped by their health status and smoker status.

Store information about the individuals as three vectors of different types.

HealthStatus = categorical(["Poor"; "Good"; "Fair"; "Fair"; "Poor"; "Excellent"; "Good"; "Excellent"]); Smoker = logical([1; 0; 0; 1; 1; 0; 0; 1]); Weight = [176; 153; 131; 133; 119; 120; 140; 129];

Grouping by health status and smoker status, compute the mean weights.

B contains the mean for each group (NaN for empty groups). BG is a cell array containing two vectors that describe the groups as you look at their elements row-wise. For instance, the first row of BG{1} indicates that the patients in the first group have a health status Excellent, and the first row of BG{2} indicates that they are nonsmokers. Finally, BC contains the number of members in each group for the corresponding groups in BG.

[B,BG,BC] = groupsummary(Weight,{HealthStatus,Smoker},"mean","IncludeEmptyGroups",true); B

B = 8×1

120.0000 129.0000 131.0000 133.0000 146.5000 NaN NaN 147.5000

ans = 8×1 categorical Excellent Excellent Fair Fair Good Good Poor Poor

ans = 8×1 logical array

0 1 0 1 0 1 0 1

Load data containing patient information and create a table describing each patient's location, systolic and diastolic blood pressure, height, and weight.

load patients Location = categorical(Location); T = table(Location,Systolic,Diastolic,Height,Weight)

T=100×5 table Location Systolic Diastolic Height Weight _________________________ ________ _________ ______ ______

County General Hospital        124          93          71       176  
VA Hospital                    109          77          69       163  
St. Mary's Medical Center      125          83          64       131  
VA Hospital                    117          75          67       133  
County General Hospital        122          80          64       119  
St. Mary's Medical Center      121          70          68       142  
VA Hospital                    130          88          64       142  
VA Hospital                    115          82          68       180  
St. Mary's Medical Center      115          78          68       183  
County General Hospital        118          86          66       132  
County General Hospital        114          77          68       128  
St. Mary's Medical Center      115          68          66       137  
VA Hospital                    127          74          71       174  
VA Hospital                    130          95          72       202  
St. Mary's Medical Center      114          79          65       129  
VA Hospital                    130          92          71       181  
  ⋮

Grouping by location, compute the correlation between patient height and weight and the correlation between systolic and diastolic blood pressure. Use the xcov function as the method to compute the correlation. The first two input arguments to xcov describe the data to correlate, the third argument describes the lag size, and the fourth argument describes the type of normalization. For each group computation, the x and y arguments passed to xcov are specified pairwise by variable from the two cell elements ["Height","Systolic"] and ["Weight","Diastolic"].

G = groupsummary(T,"Location",@(x,y) xcov(x,y,0,"coeff"),{["Height","Systolic"],["Weight","Diastolic"]})

G=3×4 table Location GroupCount fun1_Height_Weight fun1_Systolic_Diastolic _________________________ __________ __________________ _______________________

County General Hospital          39             0.65483                  0.44187        
St. Mary's Medical Center        24             0.62047                  0.44466        
VA Hospital                      37             0.78438                  0.62256        

Alternatively, if your data is in vector or matrix form instead of in a table, you can provide the data to correlate as the first input argument of groupsummary.

[B,BG,BC] = groupsummary({[Height,Systolic],[Weight,Diastolic]},Location,@(x,y) xcov(x,y,0,"coeff"))

B = 3×2

0.6548    0.4419
0.6205    0.4447
0.7844    0.6226

BG = 3×1 categorical County General Hospital St. Mary's Medical Center VA Hospital

Input Arguments

collapse all

Input table, specified as a table or timetable.

Input array, specified as a column vector, group of column vectors stored as a matrix, or cell array of column vectors, character row vectors, or matrices.

When you specify a function handle for method that takes more than one input argument, the input array A must be a cell array of column vectors, character row vectors, or matrices. In each call to the function by group, the input arguments are the corresponding columns of each element in the cell array. For example:

Grouping variables or vectors, specified as one of these options:

Example: groupsummary(T,"Var3")

Computation method, specified as one of these values:

Method Description
"sum" Sum
"mean" Mean
"median" Median
"mode" Mode
"var" Variance
"std" Standard deviation
"min" Minimum
"max" Maximum
"range" Maximum minus minimum
"nummissing" Number of missing elements
"numunique" Number of distinct nonmissing elements
"nnz" Number of nonzero and non-NaN elements
"all" All computations previously listed

You also can specify method as a function handle that returns one output per group whose first dimension has length 1. For table input data, the function operates on each table variable separately.

When the input data is a table T and you specify a function handle for method that takes more than one input argument, you must specify datavars. Thedatavars argument must be a cell array whose elements indicate the table variables to use for each input into the method. In each call to the function by group, the input arguments are the corresponding table variables of the cell array elements. For example:

When the input data is a vector or matrix and you specify a function handle for method that takes more than one input argument, the input data A must be a cell array of vectors or matrices. In each call to the function, the input arguments are the corresponding columns of each element in the cell array. For example:

To specify multiple computations at a time, list the options in a cell array, such as {"mean","median"} or{myFun1,myFun2}.

NaN values in the input data are automatically omitted when using the method names described here, with the exception of"nummissing". To include NaN values, use a function handle for the method, such as@sum instead of "sum".

Data Types: char | string | cell | function_handle

Table variables to operate on, specified as one of the options in this table. datavars indicates which variables of the input table or timetable to apply the methods to. Other variables not specified bydatavars are not operated on and do not pass through to the output. When datavars is not specified,groupsummary operates on each nongrouping variable.

Indexing Scheme Values to Specify Examples
Variable name A string scalar or character vectorA string array or cell array of character vectorsA pattern object "A" or 'A' — A variable named A["A" "B"] or {'A','B'} — Two variables named A andB"Var"+digitsPattern(1) — Variables named"Var" followed by a single digit
Variable index An index number that refers to the location of a variable in the tableA vector of numbersA logical vector. Typically, this vector is the same length as the number of variables, but you can omit trailing0 (false) values. 3 — The third variable from the table[2 3] — The second and third variables from the table[false false true] — The third variable
Function handle A function handle that takes a table variable as input and returns a logical scalar @isnumeric — All the variables containing numeric values
Variable type A vartype subscript that selects variables of a specified type vartype("numeric") — All the variables containing numeric values

When the input data is a table T and you specify a function handle for method that takes more than one input argument, you must specify datavars. Thedatavars argument must be a cell array whose elements are any of the options in the table. The cell array elements indicate the table variables to use for each input into the method. In each call to the function by group, the input arguments are the corresponding table variables of the cell array elements. For example:

Example: groupsummary(T,groupvars,method,["Var1" "Var2" "Var4"])

Binning scheme for grouping variables or vectors, specified as one or more of the following binning methods. Grouping variables or vectors and binning scheme arguments must be the same size, or one of them can be scalar.

Example: G = groupsummary(T,"Var1",[-Inf 0 Inf])

Example: G = groupsummary(T,["Var1" "Var2"],{"none" "year"})

Name-Value Arguments

collapse all

Specify optional pairs of arguments asName1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Example: G = groupsummary(T,groupvars,groupbins,IncludedEdge="right")

Before R2021a, use commas to separate each name and value, and enclose Name in quotes.

Example: G = groupsummary(T,groupvars,groupbins,"IncludedEdge","right")

Included bin edge for binning scheme, specified as either"left" or "right", indicating which end of the bin interval is inclusive.

You can specify IncludedEdge only if you also specify groupbins, and the value applies to all binning methods for all grouping variables or vectors.

Option to treat missing values as a group, specified as a numeric or logical 1 (true) or0 (false). IfIncludeMissingGroups is true, then groupsummary treats missing values, such asNaN, in a grouping variable or vector as a group. If a grouping variable or vector has no missing values, or ifIncludeMissingGroups is false, then groupsummary does not treat missing values as a group.

Option to include empty groups in the group summary operation, specified as a numeric or logical 0 (false) or 1 (true). If IncludeEmptyGroups is false, then groupsummary omits empty groups. If IncludeEmptyGroups istrue, then groupsummary includes empty groups.

An empty group occurs in these cases:

Output Arguments

collapse all

Output table for table or timetable input data, returned as a table.G contains the computed groups, the number of elements in each group, and if method is provided, the result of the specified computations.

Output array for array input data, returned as a vector or matrix.B contains the specified computations for each group. When multiple methods are specified, groupsummary horizontally concatenates the computations in the order that they were listed.

Groups for array input data, returned as a column vector or cell array of column vectors. For a single grouping vector, the output groups are sorted according to the order returned by the unique function with the"sorted" option.

For more than one input vector, BG is a cell array containing column vectors of equal length. Information for each group is contained in the elements of a row across all vectors inBG. Each group maps to the corresponding row of the output array B.

Group counts for array input data, returned as a column vector.BC contains the number of elements in each group. The length of BC is the same as the length of the group column vectors returned in BG.

More About

collapse all

This table illustrates group summary computations.

Sample Table T Syntax Example Resulting Table
groupsummary(T,"VarA")
groupsummary(T,"VarA","mean")
groupsummary(T,["VarA" "VarB"],{"none",[-Inf 0 Inf]},"min")
groupsummary(T,"VarA",["mean" "median" "mode"],"VarB")

Tips

Alternative Functionality

Live Editor Task

You can use groupsummary functionality interactively by adding the Compute by Group task to a live script.

Compute by Group task in the Live Editor

Extended Capabilities

expand all

Thegroupsummary function supports tall arrays with the following usage notes and limitations:

For more information, see Tall Arrays.

Usage notes and limitations:

Version History

Introduced in R2018a

expand all

The groupsummary function shows improved performance when you specify at least two of the "sum", "mean","min", or "max" computation methods, especially when the number of data points per group is small.

For example, this code computes the sum and maximum for a table with one data variable and one grouping variable. The code is about 6.6x faster than in the previous release.

function timingTest data = rand(1000,1); groups = categorical(randi(500,[1000 1])); for i = 1:1200 groupsummary(data,groups,["sum" "max"]); end end

The approximate execution times are:

R2024b: 2.11 s

R2025a: 0.32 s

The code was timed on a Windows® 11, AMD EPYC™ 74F3 24-Core Processor @ 3.19 GHz test system using thetimeit function.

Apply multiple binning methods to one grouping variable or vector by specifying a cell array of binning methods.

Compute the number of distinct nonmissing elements in each group of data. Specify the "numunique" or "all" computation method.

The "all" computation method now returns the number of unique values in addition the computation methods in the previous release.

Character arrays have no default definition of a standard missing value. Therefore, the nummissing method treats blank character array elements (' ') as nonmissing.

Generate C or C++ code for the groupsummary function. For usage notes and limitations, see C/C++ Code Generation.

The groupsummary function shows improved performance, especially when the data count in each group is small.

For example, this code performs group summary computations on a matrix with 500 groups with a count of 10 each. The code is about 2.70x faster than in the previous release.

function timingGroupsummary data = (1:5000)'; groups = repelem(1:length(data)/10,10)'; p = randperm(length(data)); data = data(p); groups = groups(p);

tic for k = 1:300 G = groupsummary(data,groups,"mean"); end toc end

The approximate execution times are:

R2021b: 2.65 s

R2022a: 0.98 s

The code was timed on a Windows 10, Intel® Xeon® CPU E5-1650 v4 @ 3.60 GHz test system by calling thetimingGroupsummary function.

The "nummissing" and "nnz" methods no longer error for input data types with no default definition of a standard missing value.

Code that relied on the errors that MATLAB threw for those inputs, such as code within a try/catch block, may no longer catch those errors.

See Also

Functions

Live Editor Tasks

Topics