Adagrad Optimizer in Deep Learning (original) (raw)

Last Updated : 12 May, 2026

Adagrad is an optimization method that adapts the learning rate for each parameter based on past gradients, improving learning for features with different frequencies.

Working of Adagrad Algorithm

Adagrad adapts the learning rate for each parameter by using the accumulated sum of squared gradients, allowing more efficient and stable training.

**1. Initialization: Adagrad begins by initializing the parameter values randomly, just like other optimization algorithms. Additionally, it initializes a running sum of squared gradients for each parameter which will track the gradients over time.

**2. Gradient Calculation: For each training step, the gradient of the loss function with respect to the model's parameters is calculated, just like in standard gradient descent.

**3. Adaptive Learning Rate: Adagrad adjusts the learning rate for each parameter based on the accumulated sum of squared gradients, instead of using a fixed rate.

\text{lr}_t = \frac{\eta}{\sqrt{G_t + \epsilon}}

**4. Parameter Update: The model's parameters are updated by subtracting the product of the adaptive learning rate and the gradient at each step:

\theta_{t+1} = \theta_t - \text{lr}_t \cdot \nabla_{\theta}

**Where:

Use Cases of Adagrad

Variants of Adagrad Optimizer

To overcome Adagrad’s rapidly decreasing learning rate, improved variants have been developed.

**1. RMSProp (Root Mean Square Propagation):

RMSProp improves Adagrad by using an exponentially decaying average of squared gradients instead of accumulating them, preventing the learning rate from shrinking too quickly.

**Formula:

G_t = \gamma G_{t-1} + (1 - \gamma) (\nabla_{\theta} J(\theta))^2

**Where:

**Parameter update:

\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{G_t + \epsilon}} \cdot \nabla_{\theta} J(\theta)

**2. AdaDelta

AdaDelta is an improved version of Adagrad that avoids excessive accumulation of past gradients by using moving averages, leading to more stable and consistent updates.

**Formula:

\Delta \theta_{t+1} = - \frac{\sqrt{E[\Delta \theta]^2_{t}}}{\sqrt{E[\nabla_{\theta} J(\theta)]^2_{t}} + \epsilon} \cdot \nabla_{\theta} J(\theta)

**Where:

**3. Adam (Adaptive Moment Estimation)

Adam is an optimization algorithm that combines the benefits of momentum and adaptive learning rates, making it robust and widely used in deep learning.

Adam has the following update rules

m_t = \beta_1 m_{t-1} + (1 - \beta_1) \nabla_{\theta} J(\theta)

v_t = \beta_2 v_{t-1} + (1 - \beta_2) (\nabla_{\theta} J(\theta))^2

\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}

\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \cdot \hat{m}_t

Adagrad Optimizer Implementation

Below are examples of how to implement the Adagrad optimizer in TensorFlow and PyTorch.

1. TensorFlow Implementation

In TensorFlow, implementing Adagrad is easier as it's already included in the API. Here's an example where:

import tensorflow as tf from tensorflow.keras.datasets import mnist from tensorflow.keras.utils import to_categorical

(x_train, y_train), (x_test, y_test) = mnist.load_data()

x_train = x_train.reshape(-1, 784).astype('float32') / 255.0 x_test = x_test.reshape(-1, 784).astype('float32') / 255.0

model = tf.keras.Sequential([ tf.keras.layers.Dense(64, activation='relu', input_shape=(784,)), tf.keras.layers.Dense(10, activation='softmax') ])

model.compile(optimizer=tf.keras.optimizers.Adagrad(learning_rate=0.01), loss='sparse_categorical_crossentropy', metrics=['accuracy'])

model.fit(x_train, y_train, epochs=5)

`

**Output:

tensorflow_adgrad

Tensor Flow Implementation

2. PyTorch Implementation

In PyTorch, Adagrad can be used with the torch.optim.Adagrad class. Here's an example where:

import torch import torch.nn as nn import torch.optim as optim from torchvision import datasets, transforms from torch.utils.data import DataLoader

transform = transforms.Compose([ transforms.ToTensor(), transforms.Lambda(lambda x: x.view(-1)) ])

train_dataset = datasets.MNIST( root='./data', train=True, download=True, transform=transform) train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)

class SimpleModel(nn.Module): def init(self): super(SimpleModel, self).init() self.fc1 = nn.Linear(784, 64) self.fc2 = nn.Linear(64, 10)

def forward(self, x):
    x = torch.relu(self.fc1(x))
    return self.fc2(x)

model = SimpleModel()

loss_fn = nn.CrossEntropyLoss() optimizer = optim.Adagrad(model.parameters(), lr=0.01)

for epoch in range(5): for data, target in train_loader: optimizer.zero_grad() output = model(data) loss = loss_fn(output, target) loss.backward() optimizer.step() print(f"Epoch {epoch+1} complete")

`

**Output:

pytorch_ad

PyTorch Implementation

By applying Adagrad in appropriate scenarios and complementing it with other techniques like RMSProp and Adam, practitioners can achieve faster convergence and improved model performance.

Advantages

Limitations