findgroups - Find groups and return group numbers - MATLAB (original) (raw)
Find groups and return group numbers
Syntax
Description
To split data into groups and apply a function to the groups, use thefindgroups
and splitapply
functions together. For more information about calculations on groups of data, see Calculations on Groups of Data.
[G](#butc66v-G) = findgroups([A](#butc66v-A))
returnsG
, a vector of group numbers created from the grouping variable A
. The output argument G
contains integer values from 1 to N
, indicating N
distinct groups for the N
unique values inA
. For example, if A
is["b","a","a","b"]
, then findgroups
returns G
as [2 1 1 2]
. In other words, the group numbers in G
correspond to the sorted unique values in A
.
To use G
to split groups of data out of other variables, pass it as an input argument to the splitapply function.
The findgroups
function treats empty character vectors andNaN
, NaT
, and undefined categorical values in A
as missing values and returnsNaN
as the corresponding elements ofG
.
[G](#butc66v-G) = findgroups([A](#butc66v-A)1,...,[A](#butc66v-A)N)
creates group numbers from A1,...,AN
. Thefindgroups
function defines groups as the unique combinations of values across A1,...,AN
. For example, ifA1
is ["a","a","b","b"]
andA2
is [0 1 0 0]
, thenfindgroups(A1,A2)
returns G
as[1 2 3 3]
, because the combination "b" 0
occurs twice.
[[G](#butc66v-G),[ID](#butc66v-ID)] = findgroups([A](#butc66v-A))
also returns the sorted unique values for each group in ID
. For example, if A
is["b","a","a","b"]
, then findgroups
returns G
as [2 1 1 2]
andID
as ["a","b"]
. The argumentsA
and ID
are the same data type, but need not be the same size.
[[G](#butc66v-G),[ID](#butc66v-ID)1,...,[ID](#butc66v-ID)N] = findgroups([A](#butc66v-A)1,...,[A](#butc66v-A)N)
also returns the sorted unique values for each group acrossID1,...,IDN
. The values acrossID1,...,IDN
define the groups. For example, ifA1
is ["a","a","b","b"]
andA2
is [0 1 0 0]
, thenfindgroups(A1,A2)
returns G
as[1 2 3 3]
, and ID1
andID2
as ["a","a","b"]
and [0 1 0]
.
[G](#butc66v-G) = findgroups([T](#butc66v-T))
returns G
, a vector of group numbers created from the variables in table T
. The findgroups
function treats all the variables in T
as grouping variables.
[[G](#butc66v-G),[TID](#butc66v-TID)] = findgroups([T](#butc66v-T))
also returns TID
, a table that contains the unique values for each group. TID
contains the unique combinations of values across the variables of T
. The variables in T
and TID
have the same names, but the tables need not have the same number of rows.
Examples
Use group numbers to split patient weight measurements into groups of weights for smokers and nonsmokers. Then calculate the mean weight for each group of patients.
Load patient data from the sample file patients.mat
.
load patients whos Smoker Weight
Name Size Bytes Class Attributes
Smoker 100x1 100 logical
Weight 100x1 800 double
Specify groups with findgroups
. Each element of G
is a group number that specifies which group a patient is in. Group 1
contains nonsmokers and group 2
contains smokers.
G = 100×1
2
1
1
1
1
1
2
1
1
1
1
1
1
2
1
⋮
Display the weights of the patients.
Weight = 100×1
176 163 131 133 119 142 142 180 183 132 128 137 174 202 129 ⋮
Split the Weight
array into two groups of weights using G
. Apply the mean
function. The mean weight of the nonsmokers is a bit less than the mean weight of the smokers.
meanWeights = splitapply(@mean,Weight,G)
meanWeights = 2×1
149.9091 161.9412
Calculate mean weights for groups of patients. In this case, group patients by their statuses as smokers or nonsmokers, and by the hospitals where they were seen. There are three hospitals in the data set, so there are six groups of patients.
Load hospital locations, smoker status, and weights for patients from the sample file patients.mat
.
load patients whos Location Smoker Weight
Name Size Bytes Class Attributes
Location 100x1 15808 cell
Smoker 100x1 100 logical
Weight 100x1 800 double
Display the Location
and Smoker
arrays.
Location = 100×1 cell {'County General Hospital' } {'VA Hospital' } {'St. Mary's Medical Center'} {'VA Hospital' } {'County General Hospital' } {'St. Mary's Medical Center'} {'VA Hospital' } {'VA Hospital' } {'St. Mary's Medical Center'} {'County General Hospital' } {'County General Hospital' } {'St. Mary's Medical Center'} {'VA Hospital' } {'VA Hospital' } {'St. Mary's Medical Center'} {'VA Hospital' } {'St. Mary's Medical Center'} {'VA Hospital' } {'County General Hospital' } {'County General Hospital' } {'VA Hospital' } {'VA Hospital' } {'VA Hospital' } {'County General Hospital' } {'County General Hospital' } {'VA Hospital' } {'VA Hospital' } {'County General Hospital' } {'County General Hospital' } {'County General Hospital' } ⋮
Smoker = 100×1 logical array
1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 ⋮
Specify groups using locations and smoker status. G
contains integers from one to six because there are six possible combinations of values from Smoker
and Location
.
G = findgroups(Location,Smoker)
G = 100×1
2
5
3
5
1
3
6
5
3
1
1
3
5
6
3
⋮
Calculate the mean weight for each group. There is less variation by location than by status as a smoker.
meanWeights = splitapply(@mean,Weight,G)
meanWeights = 6×1
150.1739 159.8125 146.8947 158.4000 152.0417 165.9231
Calculate the mean weights for groups of patients and display the results in a table. To associate the mean weights with group IDs, use the second output argument from findgroups
.
Load patient weights and smoker statuses from the sample file patients.mat
.
load patients whos Smoker Weight
Name Size Bytes Class Attributes
Smoker 100x1 100 logical
Weight 100x1 800 double
Specify groups using findgroups
. The values in the output argument ID
are labels for the groups that findgroups
finds in the grouping variable.
[G,ID] = findgroups(Smoker)
G = 100×1
2
1
1
1
1
1
2
1
1
1
1
1
1
2
1
⋮
ID = 2×1 logical array
0 1
Calculate the mean weights. Create a table that contains the mean weights.
meanWeight = splitapply(@mean,Weight,G); T = table(ID,meanWeight,'VariableNames',["Smokers","Mean Weights"])
T=2×2 table Smokers Mean Weights _______ ____________
false 149.91
true 161.94
Calculate mean weights for groups of patients and display the results in a table. In this case, group patients by their statuses as smokers or nonsmokers, and by the hospitals where they were seen.
Load hospital locations, smoker status, and weights for patients from the sample file patients.mat
.
load patients whos Location Smoker Weight
Name Size Bytes Class Attributes
Location 100x1 15808 cell
Smoker 100x1 100 logical
Weight 100x1 800 double
Convert Location
to a string array. Then specify groups using locations and smoker status. You can specify two group IDs as additional outputs because you specify two grouping variables as inputs. There are six possible combinations of locations and smoker status. Together ID1
and ID2
provide IDs for the six groups.
Location = string(Location); [G,ID1,ID2] = findgroups(Location,Smoker)
G = 100×1
2
5
3
5
1
3
6
5
3
1
1
3
5
6
3
⋮
ID1 = 6×1 string "County General Hospital" "County General Hospital" "St. Mary's Medical Center" "St. Mary's Medical Center" "VA Hospital" "VA Hospital"
ID2 = 6×1 logical array
0 1 0 1 0 1
Calculate the mean weight for each group.
meanWeights = splitapply(@mean,Weight,G)
meanWeights = 6×1
150.1739 159.8125 146.8947 158.4000 152.0417 165.9231
Create a table with the mean weight for each group of patients.
T = table(ID1,ID2,meanWeights,'VariableNames',["Hospital","Smoker","Mean Weight"])
T=6×3 table Hospital Smoker Mean Weight ___________________________ ______ ___________
"County General Hospital" false 150.17
"County General Hospital" true 159.81
"St. Mary's Medical Center" false 146.89
"St. Mary's Medical Center" true 158.4
"VA Hospital" false 152.04
"VA Hospital" true 165.92
Calculate mean weights for patients using grouping variables that are in a table.
Load hospital locations and smoking statuses for 100 patients into a table.
load patients T = table(Location,Smoker)
T=100×2 table Location Smoker _____________________________ ______
{'County General Hospital' } true
{'VA Hospital' } false
{'St. Mary's Medical Center'} false
{'VA Hospital' } false
{'County General Hospital' } false
{'St. Mary's Medical Center'} false
{'VA Hospital' } true
{'VA Hospital' } false
{'St. Mary's Medical Center'} false
{'County General Hospital' } false
{'County General Hospital' } false
{'St. Mary's Medical Center'} false
{'VA Hospital' } false
{'VA Hospital' } true
{'St. Mary's Medical Center'} false
{'VA Hospital' } true
⋮
Specify groups of patients using the Smoker
and Location
variables in T
.
G = 100×1
2
5
3
5
1
3
6
5
3
1
1
3
5
6
3
⋮
Calculate mean weights from the data array Weight
.
meanWeights = splitapply(@mean,Weight,G)
meanWeights = 6×1
150.1739 159.8125 146.8947 158.4000 152.0417 165.9231
Create a table of mean weights for patients grouped by hospital location and status as a smoker or nonsmoker.
Load locations and smoking statuses for patients into a table. Convert Location
to a string array.
load patients Location = string(Location); T = table(Location,Smoker)
T=100×2 table Location Smoker ___________________________ ______
"County General Hospital" true
"VA Hospital" false
"St. Mary's Medical Center" false
"VA Hospital" false
"County General Hospital" false
"St. Mary's Medical Center" false
"VA Hospital" true
"VA Hospital" false
"St. Mary's Medical Center" false
"County General Hospital" false
"County General Hospital" false
"St. Mary's Medical Center" false
"VA Hospital" false
"VA Hospital" true
"St. Mary's Medical Center" false
"VA Hospital" true
⋮
Specify groups of patients using the Location
and Smoker
variables in T
. The output table TID
identifies the groups.
[G,TID] = findgroups(T); TID
TID=6×2 table Location Smoker ___________________________ ______
"County General Hospital" false
"County General Hospital" true
"St. Mary's Medical Center" false
"St. Mary's Medical Center" true
"VA Hospital" false
"VA Hospital" true
Calculate mean weights from the data array Weight
. Append the mean weights to TID
.
TID.meanWeight = splitapply(@mean,Weight,G)
TID=6×3 table Location Smoker meanWeight ___________________________ ______ __________
"County General Hospital" false 150.17
"County General Hospital" true 159.81
"St. Mary's Medical Center" false 146.89
"St. Mary's Medical Center" true 158.4
"VA Hospital" false 152.04
"VA Hospital" true 165.92
Input Arguments
Grouping variable, specified as a vector. The unique values inA
identify groups. You can specify grouping variables using the data types listed in the table.
Values That Specify Groups | Data Type of Grouping Variable |
---|---|
Numbers | Numeric or logical vector |
Text | String array or cell array of character vectors |
Dates and times | datetime,duration, orcalendarDuration vector |
Categories | categorical vector |
Bins | Vector of binned values, created by binning a continuous distribution of numeric,datetime, orduration values |
Grouping variables, specified as a table. findgroups
treats each table variable as a separate grouping variable.
A table variable can be a numeric, logical, string,categorical
, datetime
,duration
, or calendarDuration
vector, or a cell array of character vectors.
Output Arguments
Group numbers, returned as a vector of positive integers. ForN
groups identified in the grouping variables, every integer between 1 and N
specifies a group.G
contains NaN
where any grouping variable contains a missing string, an empty character vector, aNaN
, NaT
, or undefinedcategorical
value.
- If the grouping variables are vectors, then
G
and the grouping variables all are the same size. - If the grouping variables are in a table, the length of
G
is equal to the number of rows of the table.
Values that identify each group, returned as a vector of sorted unique values from the input argument A
. The data type ofID
is the same as the data type ofA
.
The unique values that identify each group, returned as a table. The variables of TID
have the sorted unique values from the corresponding variables of T
. However,TID
and T
need not have the same number of rows.
More About
In data analysis, you commonly perform calculations on groups of data. For such calculations, you split one or more data variables into groups of data, perform a calculation on each group, and combine the results into one or more output variables. You can specify the groups using one or more_grouping variables_. The unique values in the grouping variables define the groups that the corresponding values of the data variables belong to.
For example, the diagram shows a simple grouped calculation that splits a 6-by-1 numeric vector into two groups of data, calculates the mean of each group, and then combines the outputs into a 2-by-1 numeric vector. The 6-by-1 grouping variable has two unique values, AB
andXYZ
.
You can specify grouping variables that have numbers, text, dates and times, categories, or bins.
Extended Capabilities
This function supports tall arrays with the limitations:
- Tall tables are not supported.
- The order of the group numbers in
G
might be different compared to in-memoryfindgroups
calculations.
For more information, see Tall Arrays for Out-of-Memory Data.
Version History
Introduced in R2015b