Different Variants of Gradient Descent (original) (raw)

Last Updated : 29 Sep, 2025

Gradient descent is a optimization algorithm in machine learning used to minimize functions by iteratively moving towards the minimum. It's important as fine-tuning parameters helps us to reduce prediction errors. In this article we are going to explore different variants of gradient descent algorithms.

Different-Variants-of-Gradient-Descent

Different Variants of Gradient Descent

1. Batch Gradient Descent

Batch Gradient Descent is a variant of the gradient descent algorithm where the entire dataset is used to compute the gradient of the loss function with respect to the parameters. In each iteration the algorithm calculates the average gradient of the loss function for all the training examples and updates the model parameters accordingly.

Batch-Gradient-Descent

Batch Gradient Descent

The update rule for batch gradient descent is:

\theta = \theta - \eta \nabla J(\theta)

where:

Python Implementation

def batch_gradient_descent(X, y, theta, lr=0.01, epochs=100): m = len(y) for _ in range(epochs): gradients = (1 / m) * X.T @ (X @ theta - y) theta -= lr * gradients return theta

`

**Advantages

**Disadvantages

2. Stochastic Gradient Descent

Stochastic Gradient Descent (SGD) is a variant of the gradient descent algorithm where the model parameters are updated using the gradient of the loss function with respect to a single training example at each iteration. Unlike batch gradient descent which uses the entire dataset SGD updates the parameters more frequently, leading to faster convergence.

Stochastic-Gradient-Descent

Stochastic Gradient Descent

The update rule for SGD is:

\theta = \theta - \eta \nabla J(\theta; x^{(i)}, y^{(i)})

where:

Python Implementation

import numpy as np

def stochastic_gradient_descent(X, y, theta, lr=0.01, epochs=100): m = len(y) for _ in range(epochs): for i in range(m): xi = X[i:i + 1] yi = y[i:i + 1] gradient = xi.T @ (xi @ theta - yi) theta -= lr * gradient return theta

`

**Advantages

**Disadvantages

3. Mini-Batch Gradient Descent

Mini-Batch Gradient Descent is a compromise between Batch Gradient Descent and Stochastic Gradient Descent. Instead of using the entire dataset or a single training example Mini-Batch Gradient Descent updates the model parameters using a small, random subset of the training data called a mini-batch.

Mini-Batch-Gradient-Descent

Mini-Batch Gradient Descent

Update rule for Mini-Batch Gradient Descent is:

\theta = \theta - \eta \nabla J(\theta; \{x^{(i)}, y^{(i)}\}_{i=1}^m)

where:

Python Implementation

def mini_batch_gradient_descent(X, y, theta, lr=0.01, epochs=100, batch_size=32): m = len(y) for _ in range(epochs): indices = np.random.permutation(m) X_shuffled, y_shuffled = X[indices], y[indices] for i in range(0, m, batch_size): xb = X_shuffled[i:i + batch_size] yb = y_shuffled[i:i + batch_size] gradient = (1 / len(yb)) * xb.T @ (xb @ theta - yb) theta -= lr * gradient return theta

`

**Advantages

**Disadvantages

Momentum-Based Gradient Descent

Momentum-Based Gradient Descent is an enhancement of standard gradient descent algorithm that aims to accelerate convergence particularly in the presence of high curvature, small but consistent gradients or noisy gradients. It introduces a velocity term that accumulates the gradient of the loss function over time thereby smoothing the path taken by the parameters.

Momentum-Based-Gradient-Descent

Momentum-Based Gradient Descent

The update rule for Momentum-Based Gradient Descent is:

v_t = \gamma v_{t-1} + \eta \nabla J(\theta_t)

\theta_{t+1} = \theta_t - v_t

where:

Python Implementation

def momentum_gradient_descent(X, y, theta, lr=0.01, epochs=100, gamma=0.9): m = len(y) v = np.zeros_like(theta) for _ in range(epochs): gradient = (1 / m) * X.T @ (X @ theta - y) v = gamma * v + lr * gradient theta -= v return theta

`

**Advantages

**Disadvantages

Comparison between the variants of Gradient Descent

**Variant **Data Used **Convergence **Memory Usage **Efficiency **Key Advantage
**Batch Gradient Descent Entire dataset Stable but slow High (entire dataset) Computationally expensive Stable convergence, global view of data
**Stochastic Gradient Descent One example per iteration Fast but noisy Low (one example) Less efficient Faster convergence, good for online learning
**Mini-Batch Gradient Descent Mini-batch of data Faster and smoother Medium (mini-batch) Efficient, parallelizable Balance of speed and stability
**Momentum-Based Gradient Descent Entire dataset or mini-batch Faster and smoother Medium (like Mini-Batch) Efficient with momentum Accelerated convergence, smooth updates