dplyr Package in R Programming (original) (raw)

Last Updated : 02 May, 2025

The dplyr package for R offers efficient data manipulation functions. It makes data transformation and summarization simple with concise, readable syntax.

Key Features of dplyr

**Data Frame and Tibble

Data frames in dplyr in R is organized tables where each column stores specific types of information, like names, ages, or scores.for creating a data frame involves specifying column names and their respective values.

R `

df <- data.frame(Name = c("vipul", "jayesh", "anurag"), Age = c(25, 23, 22), Score = c(95, 89, 78)) df

`

**Output:

Name Age Score  

1 vipul 25 95
2 jayesh 23 89
3 anurag 22 78

On the other hand, tibbles, introduced through the tibble package, share similar functionality but offer enhanced user-friendly features. The syntax for creating a tibble is comparable to that of a data frame.

**Pipes ( %>% )

dplyr in R The pipe operator (%>%) in dplyr package, which allows us to chain multiple operations together, improving code readability.

R `

library(dplyr)

result <- mtcars %>% filter(mpg > 20) %>% select(mpg, cyl, hp) %>% group_by(cyl) %>% summarise(mean_hp = mean(hp))

print(result)

`

**Output:

cyl mean_hp  

_ _
1 4 82.6
2 6 110

Important dplyr Functions

dplyr in R provides various important functions that can be used for Data Manipulation. These are:

**filter()

For choosing cases and using their values as a base for doing so.

R `

Create a data frame with missing data

d <- data.frame(name = c("Abhi", "Bhavesh", "Chaman", "Dimri"), age = c(7, 5, 9, 16), ht = c(46, NA, NA, 69), school = c("yes", "yes", "no", "no"))

print(d)

Finding rows with NA value

r_w_na <- d %>% filter(is.na(ht)) print(r_w_na)

Finding rows with no NA value

r_w_na <- d %>% filter(!is.na(ht)) print(r_w_na)

`

**Output:

 name age ht school  

1 Abhi 7 46 yes
2 Bhavesh 5 NA yes
3 Chaman 9 NA no
4 Dimri 16 69 no

 name age ht school  

1 Bhavesh 5 NA yes
2 Chaman 9 NA no

name age ht school
1 Abhi 7 46 yes
2 Dimri 16 69 no

**arrange()

For reordering of the cases.

R `

Create a data frame with missing data

d <- data.frame( name = c("Abhi", "Bhavesh", "Chaman", "Dimri"), age = c(7, 5, 9, 16), ht = c(46, NA, NA, 69), school = c("yes", "yes", "no", "no") ) d

Arranging name according to the age

d.name<- arrange(d, age) print(d.name)

`

**Output:

 name age ht school  

1 Abhi 7 46 yes
2 Bhavesh 5 NA yes
3 Chaman 9 NA no
4 Dimri 16 69 no

 name age ht school  

1 Bhavesh 5 NA yes
2 Abhi 7 46 yes
3 Chaman 9 NA no
4 Dimri 16 69 no

**select() and rename()

For choosing variables and using their names as a base for doing so.

R `

d <- data.frame(name=c("Abhi", "Bhavesh", "Chaman", "Dimri"), age=c(7, 5, 9, 16), ht=c(46, NA, NA, 69), school=c("yes", "yes", "no", "no"))

startswith() function to print only ht data

select(d, starts_with("ht"))

everything except ht data

select(d, -starts_with("ht")) select(d, 1: 2) select(d, contains("a"))

Printing data of column heading which matches 'na'

select(d, matches("na"))

`

**Output:

ht
1 46
2 NA
3 NA
4 69

 name age school  

1 Abhi 7 yes
2 Bhavesh 5 yes
3 Chaman 9 no
4 Dimri 16 no

 name age  

1 Abhi 7
2 Bhavesh 5
3 Chaman 9
4 Dimri 16

 name age  

1 Abhi 7
2 Bhavesh 5
3 Chaman 9
4 Dimri 16

 name  

1 Abhi
2 Bhavesh
3 Chaman
4 Dimri

**mutate() and transmute()

Addition of new variables which are the functions of prevailing variables.

R `

d <- data.frame(name = c("Abhi", "Bhavesh", "Chaman", "Dimri"), age = c(7, 5, 9, 16), ht = c(46, NA, NA, 69), school = c("yes", "yes", "no", "no"))

Add 'x3' as sum of height and age, keeping all columns

mutate(d, x3 = ht + age)

Add 'x3' as sum of height and age, keeping only 'x3'

transmute(d, x3 = ht + age)

`

**Output:

name age ht school x3  

1 Abhi 7 46 yes 53
2 Bhavesh 5 NA yes NA
3 Chaman 9 NA no NA
4 Dimri 16 69 no 85

x3
1 53
2 NA
3 NA
4 85

**summarise()

Condensing various values to one value.

R `

d <- data.frame( name = c("Abhi", "Bhavesh", "Chaman", "Dimri"), age = c(7, 5, 9, 16), ht = c(46, NA, NA, 69), school = c("yes", "yes", "no", "no") )

summarise(d, mean = mean(age)) summarise(d, med = min(age)) summarise(d, med = max(age)) summarise(d, med = median(age))

`

**Output:

mean
1 9.25

med
1 5

med
1 16

med
1 8

**sample_n() and sample_frac()

For taking random specimens.

R `

d <- data.frame( name = c("Abhi", "Bhavesh", "Chaman", "Dimri"), age = c(7, 5, 9, 16), ht = c(46, NA, NA, 69), school = c("yes", "yes", "no", "no") )

Printing three rows

sample_n(d, 3)

Printing 50 % of the rows

sample_frac(d, 0.50)

`

**Output:

name age ht school  

1 Chaman 9 NA no
2 Dimri 16 69 no
3 Abhi 7 46 yes

name age ht school
1 Abhi 7 46 yes
2 Dimri 16 69 no