Stratified K Fold Cross Validation (original) (raw)

Last Updated : 15 Jul, 2025

**Stratified K-Fold Cross Validation is a technique used for evaluating a model. It is particularly useful for classification problems in which the class labels are not evenly distributed i.e data is imbalanced. It is a enhanced version of K-Fold Cross Validation. Key difference is that it uses **stratification which allows original distribution of each class to be maintained across each fold.

For example, if your original dataset had 80% Class 0 and 20% Class 1 your folds would reflect the same proportion of classes in your data. This creates improved and more reliable accuracy metrics.

Problem with Random Splitting

Random splitting techniques like train_test_split() or regular K-Fold can create problem if they produce imbalanced class proportions in the training and test sets. For example imagine a binary classification dataset with 100 samples where:

Using random sampling in an 80:20 split then all 80 Class 0 in the training set and all 20 Class 1 in the test set. In this case model will never learn to classify Class 1 and would give misleading accuracy.

Now, let’s use **stratified sampling on same dataset:

**1. Training Set (80 samples):

**2. Test Set (20 samples):

This ensures that both training and test sets provide an accurate representation of the full dataset's class proportions and better generalization in the evaluation set.

In real-world classification tasks distribution of observations per class is often imbalanced like in a medical dataset it could be the case that 90% of patients are healthy (Class 0) and 10% have a disease (Class 1). If we randomly split this data there may be some training/test sets that have very few sample or even no samples for the minority class that where Stratified K Fold Cross Validation becomes important.

**Implementation of Stratified K-Fold Cross-Validation

1. Importing Required Libraries

We will be using statistics and scikit learn module.

Python `

from statistics import mean, stdev from sklearn import preprocessing from sklearn.model_selection import StratifiedKFold from sklearn import linear_model from sklearn import datasets

`

2. Loading Dataset and Extracting Features

Here we will be using breast cancer dataset available in scikit learn.

cancer = datasets.load_breast_cancer()

x = cancer.data y = cancer.target

`

3. Feature Scaling (Normalization)

scaler = preprocessing.MinMaxScaler() x_scaled = scaler.fit_transform(x)

`

4. Model and K-Fold Object Setup

Here we will be using logistic regression model.

lr = linear_model.LogisticRegression()

skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=1) lst_accu_stratified = []

`

5. Applying Stratified K-Fold and Training Model

for train_index, test_index in skf.split(x, y): x_train_fold, x_test_fold = x_scaled[train_index], x_scaled[test_index] y_train_fold, y_test_fold = y[train_index], y[test_index] lr.fit(x_train_fold, y_train_fold) lst_accu_stratified.append(lr.score(x_test_fold, y_test_fold))

`

6. Printing Accuracy Results

print('List of possible accuracy:', lst_accu_stratified) print('\nMaximum Accuracy That can be obtained from this model is:', max(lst_accu_stratified)*100, '%') print('\nMinimum Accuracy:', min(lst_accu_stratified)*100, '%') print('\nOverall Accuracy:', mean(lst_accu_stratified)*100, '%') print('\nStandard Deviation is:', stdev(lst_accu_stratified))

`

**Output:

Screenshot-2025-04-15-160827

Here we can see that we got a overall accuracy of 96.6% and standard deviation of 0.02 which means our model is working fine.

By using Stratified K-Fold Cross Validation we can ensure that our machine learning model is evaluated fairly and consistently leading to more accurate predictions and better real-world performance.