Momentumbased Gradient Optimizer ML (original) (raw)

Momentum-based Gradient Optimizer - ML

Last Updated : 12 May, 2026

Momentum-based optimizers improve standard gradient descent by adding a momentum term that helps move more efficiently across the loss surface.

Momentum in Gradient Optimization

Momentum is inspired by physics, where movement depends on both current force and past velocity. In optimization, it helps smooth the learning process by incorporating past gradients, leading to faster and more stable convergence.

**Formula:

v_{t+1} = \beta v_t + (1 - \beta) \nabla L(w_t)

w_{t+1} = w_t - \eta v_{t+1}

**Where:

Understanding Hyperparameters:

Working of the Algorithm:

  1. **Velocity Update: The velocity v_t​ is updated by considering both the previous velocity which represents the momentum and the current gradient. The momentum factor \beta controls the contribution of the previous velocity to the current update.
  2. **Weight Update: The weights are updated using the velocity v_{t+1}​ which is a weighted average of the past gradients and the current gradient.

Types of Momentum-Based Optimizers

There are several variations of momentum-based optimizers each with slight modifications to the basic momentum algorithm

1. **Nesterov Accelerated Gradient (NAG)

Nesterov Accelerated Gradient is an improved version of momentum optimization that computes the gradient at a look-ahead position, leading to more accurate and faster updates.

**Formula:

v_{t+1} = \beta v_t + \nabla L(w_t - \eta \beta v_t)

w_{t+1} = w_t - \eta v_{t+1}

2. **AdaMomentum

AdaMomentum is an advanced optimization technique that combines adaptive learning rates with momentum, allowing the optimizer to adjust more effectively based on recent gradient information.

**Formula:

v_{t+1} = \beta v_t + (1 - \beta)\nabla L(w_t)

w_{t+1} = w_t - \eta_t \, v_{t+1}

3. **RMSProp (Root Mean Square Propagation)

RMSProp is an optimization algorithm that adapts the learning rate for each parameter, helping improve training stability, especially for complex and non-stationary problems.

**Formula:

s_{t+1} = \beta s_t + (1 - \beta)\left(\nabla L(w_t)\right)^2

w_{t+1} = w_t - \frac{\eta}{\sqrt{s_{t+1}} + \epsilon} \nabla L(w_t)

Advantages

Challenges