KNN Classifier in R Programming (original) (raw)

Last Updated : 2 May, 2025

K-Nearest Neighbor or KNN is a supervised non-linear classification algorithm. It is also Non-parametric in nature meaning , it doesn't make any assumption about underlying data or its distribution.

Algorithm Structure

In KNN algorithm, K specifies the number of neighbors and its algorithm is as follows:

For the Nearest Neighbor classifier, the distance between two points is expressed in the form of **Euclidean Distance.

Example:

Consider a dataset containing two features Red and Blue and we classify them. Here K =5 meaning, we are considering 5 neighbors according to Euclidean distance.

So, when a new data point enters, out of 5 neighbors, if 3 are Blue and 2 are Red, we assign the new data point to the category with most neighbors (in this case that will be Blue).

Implemention of KNN

We will perform the K-Nearest Neighbor Algorithm in R programming language using the Iris dataset.

1. Installing the Required Packages

We will install the **class package which can be used to fit a KNN model also **caTools for splitting our dataset into training and testing.

R `

install.packages("caTools") install.packages("class") install.packages("ggplot2")

library(caTools) library(class) library(ggplot2)

`

2. Importing the Dataset

We will use the Iris dataset which is a built in dataset in R programming language which contains 50 samples from each of 3 species of Iris(Iris setosa, Iris virginica, Iris versicolor). We will use the str() function to give us the feature names and structure of the dataset.

R `

data(iris) str(iris)

`

**Output:

str_irir

Structure of the data

3. Splitting data into train and test data

We first split the iris dataset into training and testing sets using a 70:30 ratio. Then, we scale the numeric feature columns (first 4) in both sets to normalize their values.

R `

split <- sample.split(iris, SplitRatio = 0.7) train_cl <- subset(iris, split == "TRUE") test_cl <- subset(iris, split == "FALSE")

train_scale <- scale(train_cl[, 1:4]) test_scale <- scale(test_cl[, 1:4])

`

4. Fitting KNN Model

We fit a KNN model using the scaled training data, where**k = 1**. The model then predicts species labels for the test set based on the nearest neighbor from the training set. Also, the Classifier Species feature is fitted in the model.

R `

classifier_knn <- knn(train = train_scale, test = test_scale, cl = train_cl$Species, k = 1)

`

5. Displaying a Confusion Matrix

We create a confusion matrix to compare the predicted labels with the actual species in the test set. This helps us evaluate how well the KNN model classified each species.

R `

cm <- table(test_cl$Species, classifier_knn) cm

`

**Output:

cm_knn

Confusion Matrix of the KNN model

6. Evaluating the Model for different K values

We test multiple values of **k to find the most suitable one for our KNN model. For each k, we calculate the miss-classification error and print the corresponding accuracy. This helps in selecting a k that balances bias and variance for better model performance.

R `

library(ggplot2)

k_values <- c(1, 3, 5, 7, 15, 19)

accuracy_values <- sapply(k_values, function(k) { classifier_knn <- knn(train = train_scale, test = test_scale, cl = train_cl$Species, k = k) 1 - mean(classifier_knn != test_cl$Species) })

accuracy_data <- data.frame(K = k_values, Accuracy = accuracy_values)

ggplot(accuracy_data, aes(x = K, y = Accuracy)) + geom_line(color = "lightblue", size = 1) + geom_point(color = "lightgreen", size = 3) + labs(title = "Model Accuracy for Different K Values", x = "Number of Neighbors (K)", y = "Accuracy") + theme_minimal()

`

**Output:

knn-k-values

KNN model performance

From the graph, we observe the following accuracy trends for different values of **k:

Therefore, the optimal value of k for our model is 5.

In this article, we implemented the K-Nearest Neighbors (KNN) algorithm on the iris dataset and evaluated model accuracy across different values of k. We found that accuracy peaked at k = 5 and 7, demonstrating the importance of tuning k for optimal performance.