dplyr Package in R Programming (original) (raw)
Last Updated : 02 May, 2025
The dplyr package for R offers efficient data manipulation functions. It makes data transformation and summarization simple with concise, readable syntax.
Key Features of dplyr
**Data Frame and Tibble
Data frames in dplyr in R is organized tables where each column stores specific types of information, like names, ages, or scores.for creating a data frame involves specifying column names and their respective values.
R `
df <- data.frame(Name = c("vipul", "jayesh", "anurag"), Age = c(25, 23, 22), Score = c(95, 89, 78)) df
`
**Output:
Name Age Score
1 vipul 25 95
2 jayesh 23 89
3 anurag 22 78
On the other hand, tibbles, introduced through the tibble package, share similar functionality but offer enhanced user-friendly features. The syntax for creating a tibble is comparable to that of a data frame.
**Pipes ( %>%
)
dplyr in R The pipe operator (%>%
) in dplyr package, which allows us to chain multiple operations together, improving code readability.
R `
library(dplyr)
result <- mtcars %>% filter(mpg > 20) %>% select(mpg, cyl, hp) %>% group_by(cyl) %>% summarise(mean_hp = mean(hp))
print(result)
`
**Output:
cyl mean_hp
_ _
1 4 82.6
2 6 110
Important dplyr Functions
dplyr in R provides various important functions that can be used for Data Manipulation. These are:
**filter()
For choosing cases and using their values as a base for doing so.
R `
Create a data frame with missing data
d <- data.frame(name = c("Abhi", "Bhavesh", "Chaman", "Dimri"), age = c(7, 5, 9, 16), ht = c(46, NA, NA, 69), school = c("yes", "yes", "no", "no"))
print(d)
Finding rows with NA value
r_w_na <- d %>% filter(is.na(ht)) print(r_w_na)
Finding rows with no NA value
r_w_na <- d %>% filter(!is.na(ht)) print(r_w_na)
`
**Output:
name age ht school
1 Abhi 7 46 yes
2 Bhavesh 5 NA yes
3 Chaman 9 NA no
4 Dimri 16 69 no
name age ht school
1 Bhavesh 5 NA yes
2 Chaman 9 NA no
name age ht school
1 Abhi 7 46 yes
2 Dimri 16 69 no
**arrange()
For reordering of the cases.
R `
Create a data frame with missing data
d <- data.frame( name = c("Abhi", "Bhavesh", "Chaman", "Dimri"), age = c(7, 5, 9, 16), ht = c(46, NA, NA, 69), school = c("yes", "yes", "no", "no") ) d
Arranging name according to the age
d.name<- arrange(d, age) print(d.name)
`
**Output:
name age ht school
1 Abhi 7 46 yes
2 Bhavesh 5 NA yes
3 Chaman 9 NA no
4 Dimri 16 69 no
name age ht school
1 Bhavesh 5 NA yes
2 Abhi 7 46 yes
3 Chaman 9 NA no
4 Dimri 16 69 no
**select() and rename()
For choosing variables and using their names as a base for doing so.
R `
d <- data.frame(name=c("Abhi", "Bhavesh", "Chaman", "Dimri"), age=c(7, 5, 9, 16), ht=c(46, NA, NA, 69), school=c("yes", "yes", "no", "no"))
startswith() function to print only ht data
select(d, starts_with("ht"))
everything except ht data
select(d, -starts_with("ht")) select(d, 1: 2) select(d, contains("a"))
Printing data of column heading which matches 'na'
select(d, matches("na"))
`
**Output:
ht
1 46
2 NA
3 NA
4 69
name age school
1 Abhi 7 yes
2 Bhavesh 5 yes
3 Chaman 9 no
4 Dimri 16 no
name age
1 Abhi 7
2 Bhavesh 5
3 Chaman 9
4 Dimri 16
name age
1 Abhi 7
2 Bhavesh 5
3 Chaman 9
4 Dimri 16
name
1 Abhi
2 Bhavesh
3 Chaman
4 Dimri
**mutate() and transmute()
Addition of new variables which are the functions of prevailing variables.
R `
d <- data.frame(name = c("Abhi", "Bhavesh", "Chaman", "Dimri"), age = c(7, 5, 9, 16), ht = c(46, NA, NA, 69), school = c("yes", "yes", "no", "no"))
Add 'x3' as sum of height and age, keeping all columns
mutate(d, x3 = ht + age)
Add 'x3' as sum of height and age, keeping only 'x3'
transmute(d, x3 = ht + age)
`
**Output:
name age ht school x3
1 Abhi 7 46 yes 53
2 Bhavesh 5 NA yes NA
3 Chaman 9 NA no NA
4 Dimri 16 69 no 85
x3
1 53
2 NA
3 NA
4 85
**summarise()
Condensing various values to one value.
R `
d <- data.frame( name = c("Abhi", "Bhavesh", "Chaman", "Dimri"), age = c(7, 5, 9, 16), ht = c(46, NA, NA, 69), school = c("yes", "yes", "no", "no") )
summarise(d, mean = mean(age)) summarise(d, med = min(age)) summarise(d, med = max(age)) summarise(d, med = median(age))
`
**Output:
mean
1 9.25
med
1 5
med
1 16
med
1 8
**sample_n() and sample_frac()
For taking random specimens.
R `
d <- data.frame( name = c("Abhi", "Bhavesh", "Chaman", "Dimri"), age = c(7, 5, 9, 16), ht = c(46, NA, NA, 69), school = c("yes", "yes", "no", "no") )
Printing three rows
sample_n(d, 3)
Printing 50 % of the rows
sample_frac(d, 0.50)
`
**Output:
name age ht school
1 Chaman 9 NA no
2 Dimri 16 69 no
3 Abhi 7 46 yes
name age ht school
1 Abhi 7 46 yes
2 Dimri 16 69 no