Regression using kNearest Neighbors in R Programming (original) (raw)

Last Updated : 1 Jul, 2025

The K-Nearest Neighbors (K-NN) is a machine learning algorithm used for both classification and regression tasks. It is a lazy learner algorithm, meaning it doesn’t build an explicit model during training. Instead, it stores the training data and uses it for prediction when new data points need to be classified or predicted.

For regression, K-NN predicts the value of a target variable by averaging the values of its nearest neighbors. This method is based on the idea that similar data points will likely have similar outcomes. In this article, we will focus on using K-NN for regression in R, where we will predict continuous values based on the features of the data points.

Working of K-NN algorithm

The K-Nearest Neighbors (K-NN) algorithm predicts the target value of a new data point by averaging the values of its nearest neighbors. Here's how it works:

  1. **Select k: Choose the number of neighbors k to consider for prediction. Typically, k is an odd number to avoid ties.
  2. **Measure Distance: Calculate the distance (often Euclidean) between the new data point and all other points in the dataset. For 2D points (x_1, y_1) and (x_2, y_2), the Euclidean distance is: d = \sqrt{(x_2 - x_1)^2 + (y_2 - y_1)^2}
  3. **Identify Neighbors: Sort the distances and select the k-nearest neighbors.
  4. **Predict Outcome: For regression, predict the target value y_{\text{pred}} by averaging the target values of the k-nearest neighbors :y_{\text{pred}} = \frac{1}{k} \sum_{i=1}^{k} y_i

The prediction is based on the average of the neighbors' target values and no explicit model is needed as the algorithm directly uses the training data.

Implementation of K-NN Algorithm for Regression in R

We will not implement the K-NN Algorithm using R programming language and perform regression.

1. Installing Required Packages

To implement K-NN in R, we need the following packages:

install.packages("caTools") install.packages("class") install.packages("ggplot2")

`

2. Loading Libraries and Importing the Dataset

We will load the necessary libraries and import the dataset. For this example, we'll use a dataset that includes customer information such as age, salary and whether or not they purchased a product.

You can download the dataset from here : Advertisement.csv

R `

library(caTools) library(class) library(ggplot2)

dataset = read.csv('Advertisement.csv') head(dataset)

`

**Output:

data

Dataset

3. Preprocessing the Data

Before applying K-NN, we need to encode the target variable (Purchased) as a factor and split the dataset into training and testing sets.

set.seed(123) split = sample.split(dataset$Purchased, SplitRatio = 0.75) training_set = subset(dataset, split == TRUE) test_set = subset(dataset, split == FALSE)

training_set[, c(3, 4)] = scale(training_set[, c(3, 4)]) test_set[, c(3, 4)] = scale(test_set[, c(3, 4)])

`

4. Applying K-NN for Regression

We will now apply the K-NN algorithm to the training set and predict the outcomes for the test set. We will use knn() function to implements the K-NN algorithm.

y_pred = knn(train = training_set[,c(3,4)], test = test_set[, c(3,4)], cl = training_set[, 5], k = 5, prob = TRUE)

`

5. Evaluating the Model

We will create a confusion matrix to evaluate how well the model performs on the test set.

cm = table(test_set[, 5], y_pred)

cm_df <- as.data.frame(cm) colnames(cm_df) <- c("Actual", "Predicted", "Count")

ggplot(cm_df, aes(x = Actual, y = Predicted, fill = Count)) + geom_tile() + geom_text(aes(label = Count), color = "white", size = 5) + scale_fill_gradient(low = "lightblue", high = "steelblue") + theme_minimal() + labs(title = "Confusion Matrix Heatmap", x = "Predicted", y = "Actual") + theme(axis.text.x = element_text(angle = 45, hjust = 1))

`

**Output:

cm

Evaluating the Model

6. Visualizing the Results

We will visualize the results using a plot to show how well the model has fit the training and test data.

**6.1. Training Set: We will visualize the decision boundary of the model using ggplot2 library.

set1 = training_set

X1 = seq(min(set[, 3]) - 1, max(set[, 3]) + 1, by = 0.01) X2 = seq(min(set[, 4]) - 1, max(set[, 4]) + 1, by = 0.01) grid_set = expand.grid(X1, X2)

colnames(grid_set) = c('Age', 'EstimatedSalary')

y_grid = knn(train = training_set[, 3:4], test = grid_set, cl = training_set[, 5], k = 5)

ggplot(data = grid_set, aes(x = Age, y = EstimatedSalary, color = factor(y_grid))) + geom_tile(aes(fill = factor(y_grid)), alpha = 0.3) + geom_point(data = set1, aes(x = Age, y = EstimatedSalary, color = factor(Purchased))) + labs(title = "K-NN Classification (Training Set)", x = "Age", y = "Estimated Salary") + scale_fill_manual(values = c("tomato", "springgreen3")) + scale_color_manual(values = c("red3", "green4"))

`

Output:

training1

Training Set

**6.2. Testing Set: We will visualize the decision boundary of the model using ggplot2 library.

set2 = test_set

X1 = seq(min(set[, 3]) - 1, max(set[, 3]) + 1, by = 0.01) X2 = seq(min(set[, 4]) - 1, max(set[, 4]) + 1, by = 0.01) grid_set = expand.grid(X1, X2)

colnames(grid_set) = c('Age', 'EstimatedSalary')

y_grid = knn(train = training_set[, 3:4], test = grid_set, cl = training_set[, 5], k = 5)

ggplot(data = grid_set, aes(x = Age, y = EstimatedSalary, color = factor(y_grid))) + geom_tile(aes(fill = factor(y_grid)), alpha = 0.3) + geom_point(data = set2, aes(x = Age, y = EstimatedSalary, color = factor(Purchased))) + labs(title = "K-NN Classification (Test Set)", x = "Age", y = "Estimated Salary") + scale_fill_manual(values = c("tomato", "springgreen3")) + scale_color_manual(values = c("red3", "green4"))

`

**Output:

test1

Testing Set:

Advantages of K-Nearest Neighbors (K-NN)

Disadvantages of K-Nearest Neighbors (K-NN)