CrossValidation in R programming (original) (raw)

Cross-Validation in R programming

Last Updated : 15 Jul, 2025

Cross-validation is an essential technique in machine learning used to assess the performance and accuracy of a model. The primary goal is to ensure that the model is not overfitting to the training data and that it will perform well on unseen, real-world data. Cross-validation involves partitioning the dataset into multiple subsets, training the model on some subsets and testing it on the remaining subsets.

Cross-Validation_

Cross-Validation in R programming

Cross-validation helps to address two major concerns:

Implementation of Cross-Validation in R

We will implement various methods like Validation Set Approach, Leave-One-Out Cross-Validation (LOOCV), K-Fold Cross-Validation and Repeated K-Fold Cross-Validation in R programming language.

Installing the Required Libraries and Loading the Dataset

To begin, we will install and load required packages and import the marketing dataset. We will be using the marketing dataset in R to demonstrate how cross-validation works.

install.packages("tidyverse") install.packages("caret") install.packages("datarium")

library(tidyverse) library(caret)

data("marketing", package = "datarium")

head(marketing)

`

**Output:

market_data

Marketing Dataset

1. Validation Set Approach

The Validation Set Approach, randomly split the dataset into training (80%) and testing (20%) sets. We then train a linear regression model using the training data. and evaluate it on the testing set using RMSE, MAE and R-Square metrics.

set.seed(123)

random_sample <- createDataPartition(marketing$sales, p = 0.8, list = FALSE) training_dataset <- marketing[random_sample, ] testing_dataset <- marketing[-random_sample, ]

model <- lm(sales ~ ., data = training_dataset)

predictions <- predict(model, testing_dataset)

df= data.frame( R2 = R2(predictions, testing_dataset$sales), RMSE = RMSE(predictions, testing_dataset$sales), MAE = MAE(predictions, testing_dataset$sales) )

df

`

**Output:

validationset

Validation Set Approach

2. Leave-One-Out Cross-Validation (LOOCV)

LOOCV splits the dataset into N-1 data points for training and 1 data point for testing. This is repeated for every data point and the average of the prediction errors is calculated.

train_control <- trainControl(method = "LOOCV")

model <- train(sales ~ ., data = marketing, method = "lm", trControl = train_control) print(model)

`

**Output:

loocv

Leave-One-Out Cross-Validation (LOOCV)

3. K-Fold Cross-Validation

In K-fold cross-validation, we split the data into K subsets (folds). The model is trained on K-1 folds and tested on the remaining fold. This process is repeated K times. Then we calculate the average prediction error.

train_control <- trainControl(method = "cv", number = 10)

model <- train(sales ~ ., data = marketing, method = "lm", trControl = train_control) print(model)

`

**Output:

kfold

K-Fold Cross-Validation

4. Repeated K-Fold Cross-Validation

Repeated K-Fold Cross-Validation method repeats K-fold cross-validation multiple times. This helps to further reduce the variance of the performance metrics. We split the data into K subsets and then train the model on K-1 subsets and test it on the left-out fold. This process is repeated for a specified number of times.

train_control <- trainControl(method = "repeatedcv", number = 10, repeats = 3)

model <- train(sales ~ ., data = marketing, method = "lm", trControl = train_control) print(model)

`

**Output:

rkfold

Repeated K-Fold Cross-Validation

In this article, we demonstrated different cross-validation techniques in R to evaluate the performance of a linear regression model. We covered the Validation Set Approach, LOOCV, K-Fold Cross-Validation and Repeated K-Fold Cross-Validation.