GitHub - tidymodels/themis: Extra recipes steps for dealing with unbalanced data (original) (raw)

themis

R-CMD-check Codecov test coverage CRAN status Downloads Lifecycle: maturing

themis contains extra steps for therecipes package for dealing with unbalanced data. The name themis is that of theancient Greek godwho is typically depicted with a balance.

Installation

You can install the released version of themis fromCRAN with:

install.packages("themis")

Install the development version from GitHub with:

install.packages("pak")

pak::pak("tidymodels/themis")

Example

Following is a example of using theSMOTEalgorithm to deal with unbalanced data

library(recipes) library(modeldata) library(themis)

data("credit_data")

credit_data0 <- credit_data %>% filter(!is.na(Job))

count(credit_data0, Job) #> Job n #> 1 fixed 2805 #> 2 freelance 1024 #> 3 others 171 #> 4 partime 452

ds_rec <- recipe(Job ~ Time + Age + Expenses, data = credit_data0) %>% step_impute_mean(all_predictors()) %>% step_smote(Job, over_ratio = 0.25) %>% prep()

ds_rec %>% bake(new_data = NULL) %>% count(Job) #> # A tibble: 4 × 2 #> Job n #> #> 1 fixed 2805 #> 2 freelance 1024 #> 3 others 701 #> 4 partime 701

Methods

Below is some unbalanced data. Used for examples latter.

example_data <- data.frame(class = letters[rep(1:5, 1:5 * 10)], x = rnorm(150))

library(ggplot2)

example_data %>% ggplot(aes(class)) + geom_bar()

Bar chart with 5 columns. class on the x-axis and count on the y-axis. Class a has height 10, b has 20, c has 30, d has 40, and e has 50.

Upsample / Over-sampling

The following methods all share the tuning parameter over_ratio, which is the ratio of the minority-to-majority frequencies.

name function Multi-class
Random minority over-sampling with replacement step_upsample() ✔️
Synthetic Minority Over-sampling Technique step_smote() ✔️
Borderline SMOTE-1 step_bsmote(method = 1) ✔️
Borderline SMOTE-2 step_bsmote(method = 2) ✔️
Adaptive synthetic sampling approach for imbalanced learning step_adasyn() ✔️
Generation of synthetic data by Randomly Over Sampling Examples step_rose()

By setting over_ratio = 1 you bring the number of samples of all minority classes equal to 100% of the majority class.

recipe(~., example_data) %>% step_upsample(class, over_ratio = 1) %>% prep() %>% bake(new_data = NULL) %>% ggplot(aes(class)) + geom_bar()

Bar chart with 5 columns. class on the x-axis and count on the y-axis. class a, b, c, d, and e all have a height of 50.

and by setting over_ratio = 0.5 we upsample any minority class with less samples then 50% of the majority up to have 50% of the majority.

recipe(~., example_data) %>% step_upsample(class, over_ratio = 0.5) %>% prep() %>% bake(new_data = NULL) %>% ggplot(aes(class)) + geom_bar()

Bar chart with 5 columns. class on the x-axis and count on the y-axis. Class a has height 25, b has 25, c has 30, d has 40, and e has 50.

Downsample / Under-sampling

Most of the the following methods all share the tuning parameterunder_ratio, which is the ratio of the majority-to-minority frequencies.

name function Multi-class under_ratio
Random majority under-sampling with replacement step_downsample() ✔️ ✔️
NearMiss-1 step_nearmiss() ✔️ ✔️
Extraction of majority-minority Tomek links step_tomek()

By setting under_ratio = 1 you bring the number of samples of all majority classes equal to 100% of the minority class.

recipe(~., example_data) %>% step_downsample(class, under_ratio = 1) %>% prep() %>% bake(new_data = NULL) %>% ggplot(aes(class)) + geom_bar()

Bar chart with 5 columns. class on the x-axis and count on the y-axis. Class a, b, c, d, and e all have a height of 10.

and by setting under_ratio = 2 we downsample any majority class with more then 200% samples of the minority class down to have to 200% samples of the minority.

recipe(~., example_data) %>% step_downsample(class, under_ratio = 2) %>% prep() %>% bake(new_data = NULL) %>% ggplot(aes(class)) + geom_bar()

Bar chart with 5 columns. class on the x-axis and count on the y-axis. Class a has height 10, b, c, d, and e have ha height of 20.

Contributing

This project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.