Encoding Categorical Data in R (original) (raw)

Last Updated : 12 Mar, 2026

Encoding is the process of converting categorical data into numerical values. Categorical data is a type of data which can be classified into categories or groups (such as colors or job titles). Since categorical variables cannot be directly used in statistical analysis or machine learning models, encoding is necessary to represent them in a format that models can process.

categorical_data_encoding_techniques

Encoding Techniques

1. Label Encoding

Label Encoding is a technique that converts categorical variables into numeric values by assigning a unique integer to each category. In R, this can be done using the factor() function followed by as.integer() to get the numeric representation.

categorical_data_encoding_2

Label Encoding

Here we implement Label Encoding in R for the ‘color’ column using factor() and as.integer() to convert each category into a unique numeric value.

color <- c("red", "green", "blue", "blue", "red") df <- data.frame(color)

df$color <- as.integer(factor(df$color))

df

`

Output

color 1 3 2 2 3 1 4 1 5 3

2. One-Hot Encoding

One-Hot Encoding is a technique used to convert categorical data into a binary format, where each unique category becomes its own column. This ensures that models can process categorical variables without assuming any order between categories.

categorical_data_encoding

One-Hot Encoding

Here we implement One-Hot Encoding in R for the ‘gender’ column using model.matrix() to convert categories into binary columns.

R `

gender <- c("male", "female", "male", "male", "female") age <- c(23, 34, 52, 21, 19) income <- c(50000, 70000, 80000, 45000, 55000) df <- data.frame(gender, age, income)

encoded_gender <- model.matrix(~ gender - 1, data = df)

encoded_gender

`

Output

genderfemale gendermale 1 0 1 2 1 0 3 0 1 4 0 1 5 1 0 attr(,"assign") [1] 1 1 attr(,"contrasts") a...

3. Binary Encoding

Binary Encoding is a method that converts categorical variables into binary numbers and splits these binary digits across multiple columns. It is especially useful for features with many unique categories because it reduces the number of columns compared to One-Hot Encoding.

color

Binary Encoding

Here we implement Binary Encoding in R using the mltools package to convert the ‘color’ column into binary columns.

city <- c("Delhi", "Mumbai", "Delhi", "Chennai", "Mumbai") df <- data.frame(City = city) df$Label <- as.numeric(factor(df$City)) - 1 decimal_to_binary <- function(x){ paste(rev(as.integer(intToBits(x))[1:4]), collapse = "") } df$Binary <- sapply(df$Label, decimal_to_binary)

print(df)

`

Output

 City Label Binary

1 Delhi 1 0001 2 Mumbai 2 0010 3 Delhi 1 0001 4 Chennai 0 0000 5 Mumbai 2 0010

4. Frequency Encoding

Frequency Encoding is a method that replaces each category with its occurrence frequency in the dataset. It provides a compact numerical representation of categorical data, making it suitable for datasets with many categories.

apriori_algorithm_

Frequency Encoding

Here we implement Frequency Encoding in R by replacing each category with its occurrence count in the dataset.

R `

color <- c("red", "green", "blue", "blue", "red", "red") df <- data.frame(color)

freq_table <- table(df$color) df$color_freq <- freq_table[df$color]

df

`

Output

color color_freq 1 red 3 2 green 1 3 blue 2 4 blue 2 5 red 3 6 red 3

5. Target/Mean Encoding

Target Encoding, also called Mean Encoding, replaces each category in a feature with the mean value of the target variable for that category. This allows models to capture the relationship between the categorical feature and the target.

apriori_algorithm_6

Target Encoding

Here we implement Target/Mean Encoding in R by replacing each category with the mean of the target variable for that category.

R `

df <- data.frame( product = c("A", "B", "A", "C", "B", "C", "A"), sales = c(200, 150, 220, 130, 160, 140, 210) ) mean_sales <- tapply(df$sales, df$product, mean) df$product_mean <- mean_sales[df$product]

df

`

Output

product sales product_mean 1 A 200 210 2 B 150 155 3 A 220 210 4 C 130 135 5 B 160 155 6 C 140 1...

6. Ordinal Encoding

Ordinal Encoding is a method that assigns integer values to categories based on their natural order or ranking. Unlike Label Encoding, the order of categories is meaningful and preserved in the numeric representation.

apriori_algorithm_7

Ordinal Encoding

Here we implement Ordinal Encoding in R by assigning integers to categories based on their natural order.

R `

education <- c("High School", "Bachelor", "Master", "Bachelor", "PhD") df <- data.frame(education) education_levels <- c("High School", "Bachelor", "Master", "PhD") df$education_encoded <- as.integer(factor(df$education, levels = education_levels, ordered = TRUE)) df

`

Output

education education_encoded

1 High School 1 2 Bachelor 2 3 Master 3 4 Bachelor 2 5 PhD 4