Grouped data (original) (raw)

dplyr verbs are particularly powerful when you apply them to grouped data frames (grouped_df objects). This vignette shows you:

We’ll start by loading dplyr:

group_by()

The most important grouping verb is [group_by()](../reference/group%5Fby.html): it takes a data frame and one or more variables to group by:

You can see the grouping when you print the data:

by_species
#> # A tibble: 87 × 14
#> # Groups:   species [38]
#>   name    height  mass hair_color skin_color eye_color birth_year sex  
#>   <chr>    <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr>
#> 1 Luke S…    172    77 blond      fair       blue            19   male 
#> 2 C-3PO      167    75 NA         gold       yellow         112   none 
#> 3 R2-D2       96    32 NA         white, bl… red             33   none 
#> 4 Darth …    202   136 none       white      yellow          41.9 male 
#> # ℹ 83 more rows
#> # ℹ 6 more variables: gender <chr>, homeworld <chr>, species <chr>,
#> #   films <list>, vehicles <list>, starships <list>
by_sex_gender
#> # A tibble: 87 × 14
#> # Groups:   sex, gender [6]
#>   name    height  mass hair_color skin_color eye_color birth_year sex  
#>   <chr>    <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr>
#> 1 Luke S…    172    77 blond      fair       blue            19   male 
#> 2 C-3PO      167    75 NA         gold       yellow         112   none 
#> 3 R2-D2       96    32 NA         white, bl… red             33   none 
#> 4 Darth …    202   136 none       white      yellow          41.9 male 
#> # ℹ 83 more rows
#> # ℹ 6 more variables: gender <chr>, homeworld <chr>, species <chr>,
#> #   films <list>, vehicles <list>, starships <list>

Or use [tally()](../reference/count.html) to count the number of rows in each group. The sort argument is useful if you want to see the largest groups up front.

by_species %>% tally()
#> # A tibble: 38 × 2
#>   species      n
#>   <chr>    <int>
#> 1 Aleena       1
#> 2 Besalisk     1
#> 3 Cerean       1
#> 4 Chagrian     1
#> # ℹ 34 more rows

by_sex_gender %>% tally(sort = TRUE)
#> # A tibble: 6 × 3
#> # Groups:   sex [5]
#>   sex    gender        n
#>   <chr>  <chr>     <int>
#> 1 male   masculine    60
#> 2 female feminine     16
#> 3 none   masculine     5
#> 4 NA     NA            4
#> # ℹ 2 more rows

As well as grouping by existing variables, you can group by any function of existing variables. This is equivalent to performing a[mutate()](../reference/mutate.html) before the[group_by()](../reference/group%5Fby.html):

bmi_breaks <- c(0, 18.5, 25, 30, Inf)

starwars %>%
  group_by(bmi_cat = cut(mass/(height/100)^2, breaks=bmi_breaks)) %>%
  tally()
#> # A tibble: 5 × 2
#>   bmi_cat       n
#>   <fct>     <int>
#> 1 (0,18.5]     10
#> 2 (18.5,25]    24
#> 3 (25,30]      13
#> 4 (30,Inf]     12
#> # ℹ 1 more row

You can see underlying group data with [group_keys()](../reference/group%5Fdata.html). It has one row for each group and one column for each grouping variable:

by_species %>% group_keys()
#> # A tibble: 38 × 1
#>   species 
#>   <chr>   
#> 1 Aleena  
#> 2 Besalisk
#> 3 Cerean  
#> 4 Chagrian
#> # ℹ 34 more rows

by_sex_gender %>% group_keys()
#> # A tibble: 6 × 2
#>   sex            gender   
#>   <chr>          <chr>    
#> 1 female         feminine 
#> 2 hermaphroditic masculine
#> 3 male           masculine
#> 4 none           feminine 
#> # ℹ 2 more rows

You can see which group each row belongs to with[group_indices()](../reference/group%5Fdata.html):

by_species %>% group_indices()
#>  [1] 11  6  6 11 11 11 11  6 11 11 11 11 34 11 24 12 11 38 36 11 11  6
#> [23] 31 11 11 18 11 11  8 26 11 21 11 11 10 10 10 11 30  7 11 11 37 32
#> [45] 32  1 33 35 29 11  3 20 37 27 13 23 16  4 38 38 11  9 17 17 11 11
#> [67] 11 11  5  2 15 15 11  6 25 19 28 14 34 11 38 22 11 11 11  6 11

And which rows each group contains with[group_rows()](../reference/group%5Fdata.html):

by_species %>% group_rows() %>% head()
#> <list_of<integer>[6]>
#> [[1]]
#> [1] 46
#> 
#> [[2]]
#> [1] 70
#> 
#> [[3]]
#> [1] 51
#> 
#> [[4]]
#> [1] 58
#> 
#> [[5]]
#> [1] 69
#> 
#> [[6]]
#> [1]  2  3  8 22 74 86

Use [group_vars()](../reference/group%5Fdata.html) if you just want the names of the grouping variables:

Changing and adding to grouping variables

If you apply [group_by()](../reference/group%5Fby.html) to an already grouped dataset, will overwrite the existing grouping variables. For example, the following code groups by homeworld instead ofspecies:

by_species %>%
  group_by(homeworld) %>%
  tally()
#> # A tibble: 49 × 2
#>   homeworld       n
#>   <chr>       <int>
#> 1 Alderaan        3
#> 2 Aleen Minor     1
#> 3 Bespin          1
#> 4 Bestine IV      1
#> # ℹ 45 more rows

To augment the grouping, using.add = TRUE1. For example, the following code groups by species and homeworld:

by_species %>%
  group_by(homeworld, .add = TRUE) %>%
  tally()
#> # A tibble: 57 × 3
#> # Groups:   species [38]
#>   species  homeworld       n
#>   <chr>    <chr>       <int>
#> 1 Aleena   Aleen Minor     1
#> 2 Besalisk Ojom            1
#> 3 Cerean   Cerea           1
#> 4 Chagrian Champala        1
#> # ℹ 53 more rows

Removing grouping variables

To remove all grouping variables, use [ungroup()](../reference/group%5Fby.html):

You can also choose to selectively ungroup by listing the variables you want to remove:

by_sex_gender %>% 
  ungroup(sex) %>% 
  tally()
#> # A tibble: 3 × 2
#>   gender        n
#>   <chr>     <int>
#> 1 feminine     17
#> 2 masculine    66
#> 3 NA            4

Verbs

The following sections describe how grouping affects the main dplyr verbs.

summarise()

[summarise()](../reference/summarise.html) computes a summary for each group. This means that it starts from [group_keys()](../reference/group%5Fdata.html), adding summary variables to the right hand side:

by_species %>%
  summarise(
    n = n(),
    height = mean(height, na.rm = TRUE)
  )
#> # A tibble: 38 × 3
#>   species      n height
#>   <chr>    <int>  <dbl>
#> 1 Aleena       1     79
#> 2 Besalisk     1    198
#> 3 Cerean       1    198
#> 4 Chagrian     1    196
#> # ℹ 34 more rows

The .groups= argument controls the grouping structure of the output. The historical behaviour of removing the right hand side grouping variable corresponds to .groups = "drop_last"without a message or .groups = NULL with a message (the default).

Since version 1.0.0 the groups may also be kept (.groups = "keep") or dropped (.groups = "drop").

When the output no longer have grouping variables, it becomes ungrouped (i.e. a regular tibble).

select(), rename(), andrelocate()

[rename()](../reference/rename.html) and [relocate()](../reference/relocate.html) behave identically with grouped and ungrouped data because they only affect the name or position of existing columns. Grouped [select()](../reference/select.html) is almost identical to ungrouped select, except that it always includes the grouping variables:

by_species %>% select(mass)
#> Adding missing grouping variables: `species`
#> # A tibble: 87 × 2
#> # Groups:   species [38]
#>   species  mass
#>   <chr>   <dbl>
#> 1 Human      77
#> 2 Droid      75
#> 3 Droid      32
#> 4 Human     136
#> # ℹ 83 more rows

If you don’t want the grouping variables, you’ll have to first[ungroup()](../reference/group%5Fby.html). (This design is possibly a mistake, but we’re stuck with it for now.)

arrange()

Grouped [arrange()](../reference/arrange.html) is the same as ungrouped[arrange()](../reference/arrange.html), unless you set .by_group = TRUE, in which case it will order first by the grouping variables.

by_species %>%
  arrange(desc(mass)) %>%
  relocate(species, mass)
#> # A tibble: 87 × 14
#> # Groups:   species [38]
#>   species  mass name  height hair_color skin_color eye_color birth_year
#>   <chr>   <dbl> <chr>  <int> <chr>      <chr>      <chr>          <dbl>
#> 1 Hutt     1358 Jabb…    175 NA         green-tan… orange         600  
#> 2 Kaleesh   159 Grie…    216 none       brown, wh… green, y…       NA  
#> 3 Droid     140 IG-88    200 none       metal      red             15  
#> 4 Human     136 Dart…    202 none       white      yellow          41.9
#> # ℹ 83 more rows
#> # ℹ 6 more variables: sex <chr>, gender <chr>, homeworld <chr>,
#> #   films <list>, vehicles <list>, starships <list>

by_species %>%
  arrange(desc(mass), .by_group = TRUE) %>%
  relocate(species, mass)
#> # A tibble: 87 × 14
#> # Groups:   species [38]
#>   species  mass name  height hair_color skin_color eye_color birth_year
#>   <chr>   <dbl> <chr>  <int> <chr>      <chr>      <chr>          <dbl>
#> 1 Aleena     15 Ratt…     79 none       grey, blue unknown           NA
#> 2 Besali…   102 Dext…    198 none       brown      yellow            NA
#> 3 Cerean     82 Ki-A…    198 white      pale       yellow            92
#> 4 Chagri…    NA Mas …    196 none       blue       blue              NA
#> # ℹ 83 more rows
#> # ℹ 6 more variables: sex <chr>, gender <chr>, homeworld <chr>,
#> #   films <list>, vehicles <list>, starships <list>

Note that second example is sorted by species (from the[group_by()](../reference/group%5Fby.html) statement) and then by mass (within species).

mutate()

In simple cases with vectorised functions, grouped and ungrouped[mutate()](../reference/mutate.html) give the same results. They differ when used with summary functions:

# Subtract off global mean
starwars %>% 
  select(name, homeworld, mass) %>% 
  mutate(standard_mass = mass - mean(mass, na.rm = TRUE))
#> # A tibble: 87 × 4
#>   name           homeworld  mass standard_mass
#>   <chr>          <chr>     <dbl>         <dbl>
#> 1 Luke Skywalker Tatooine     77         -20.3
#> 2 C-3PO          Tatooine     75         -22.3
#> 3 R2-D2          Naboo        32         -65.3
#> 4 Darth Vader    Tatooine    136          38.7
#> # ℹ 83 more rows

# Subtract off homeworld mean
starwars %>% 
  select(name, homeworld, mass) %>% 
  group_by(homeworld) %>% 
  mutate(standard_mass = mass - mean(mass, na.rm = TRUE))
#> # A tibble: 87 × 4
#> # Groups:   homeworld [49]
#>   name           homeworld  mass standard_mass
#>   <chr>          <chr>     <dbl>         <dbl>
#> 1 Luke Skywalker Tatooine     77         -8.38
#> 2 C-3PO          Tatooine     75        -10.4 
#> 3 R2-D2          Naboo        32        -32.2 
#> 4 Darth Vader    Tatooine    136         50.6 
#> # ℹ 83 more rows

Or with window functions like [min_rank()](../reference/row%5Fnumber.html):

# Overall rank
starwars %>% 
  select(name, homeworld, height) %>% 
  mutate(rank = min_rank(height))
#> # A tibble: 87 × 4
#>   name           homeworld height  rank
#>   <chr>          <chr>      <int> <int>
#> 1 Luke Skywalker Tatooine     172    28
#> 2 C-3PO          Tatooine     167    20
#> 3 R2-D2          Naboo         96     5
#> 4 Darth Vader    Tatooine     202    72
#> # ℹ 83 more rows

# Rank per homeworld
starwars %>% 
  select(name, homeworld, height) %>% 
  group_by(homeworld) %>% 
  mutate(rank = min_rank(height))
#> # A tibble: 87 × 4
#> # Groups:   homeworld [49]
#>   name           homeworld height  rank
#>   <chr>          <chr>      <int> <int>
#> 1 Luke Skywalker Tatooine     172     5
#> 2 C-3PO          Tatooine     167     4
#> 3 R2-D2          Naboo         96     1
#> 4 Darth Vader    Tatooine     202    10
#> # ℹ 83 more rows

filter()

A grouped [filter()](../reference/filter.html) effectively does a[mutate()](../reference/mutate.html) to generate a logical variable, and then only keeps the rows where the variable is TRUE. This means that grouped filters can be used with summary functions. For example, we can find the tallest character of each species:

by_species %>%
  select(name, species, height) %>% 
  filter(height == max(height))
#> # A tibble: 36 × 3
#> # Groups:   species [36]
#>   name                  species        height
#>   <chr>                 <chr>           <int>
#> 1 Greedo                Rodian            173
#> 2 Jabba Desilijic Tiure Hutt              175
#> 3 Yoda                  Yoda's species     66
#> 4 Bossk                 Trandoshan        190
#> # ℹ 32 more rows

You can also use [filter()](../reference/filter.html) to remove entire groups. For example, the following code eliminates all groups that only have a single member:

by_species %>%
  filter(n() != 1) %>% 
  tally()
#> # A tibble: 9 × 2
#>   species      n
#>   <chr>    <int>
#> 1 Droid        6
#> 2 Gungan       3
#> 3 Human       35
#> 4 Kaminoan     2
#> # ℹ 5 more rows

slice() and friends

[slice()](../reference/slice.html) and friends ([slice_head()](../reference/slice.html),[slice_tail()](../reference/slice.html), [slice_sample()](../reference/slice.html),[slice_min()](../reference/slice.html) and [slice_max()](../reference/slice.html)) select rows within a group. For example, we can select the first observation within each species:

by_species %>%
  relocate(species) %>% 
  slice(1)
#> # A tibble: 38 × 14
#> # Groups:   species [38]
#>   species name  height  mass hair_color skin_color eye_color birth_year
#>   <chr>   <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl>
#> 1 Aleena  Ratt…     79    15 none       grey, blue unknown           NA
#> 2 Besali… Dext…    198   102 none       brown      yellow            NA
#> 3 Cerean  Ki-A…    198    82 white      pale       yellow            92
#> 4 Chagri… Mas …    196    NA none       blue       blue              NA
#> # ℹ 34 more rows
#> # ℹ 6 more variables: sex <chr>, gender <chr>, homeworld <chr>,
#> #   films <list>, vehicles <list>, starships <list>

Similarly, we can use [slice_min()](../reference/slice.html) to select the smallestn values of a variable:

by_species %>%
  filter(!is.na(height)) %>% 
  slice_min(height, n = 2)
#> # A tibble: 47 × 14
#> # Groups:   species [38]
#>   name    height  mass hair_color skin_color eye_color birth_year sex  
#>   <chr>    <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr>
#> 1 Ratts …     79    15 none       grey, blue unknown           NA male 
#> 2 Dexter…    198   102 none       brown      yellow            NA male 
#> 3 Ki-Adi…    198    82 white      pale       yellow            92 male 
#> 4 Mas Am…    196    NA none       blue       blue              NA male 
#> # ℹ 43 more rows
#> # ℹ 6 more variables: gender <chr>, homeworld <chr>, species <chr>,
#> #   films <list>, vehicles <list>, starships <list>