Learning Rate Decay (original) (raw)

Last Updated : 23 Jul, 2025

When training a machine learning model, the learning rate plays a important role in determining how quickly the model adjusts its weights based on the errors it makes. If we start with a learning rate that's too high, the model might learn quickly but could overshoot the best solution. If it's too low, learning can become too slow and the model might get stuck before reaching an optimal solution.

To address this learning rate decay was introduced which helps us adjust the learning rate during training. We start with a higher rate which allows the model to make larger updates and learn faster. As training progresses and the model gets closer to an optimal solution, the learning rate decreases allowing for finer adjustments and better convergence.

Why Use Learning Rate Decay?

Working of Learning Rate Decay

Learning rate decay works similarly to driving toward a parking spot. Initially, we drive fast to cover more distance quickly but as we get closer to our destination, we slow down to park more accurately. In machine learning, this concept translates to starting with a larger learning rate to make faster progress in the beginning and then gradually reducing it to fine-tune the model’s weights in the later stages of training.

The decay is designed to allow the model to make large, broad adjustments early in training and more delicate adjustments as it approaches the optimal solution. This controlled approach helps the model converge more efficiently without overshooting or getting stuck.

There are several methods to implement learning rate decay each with a different approach to how the learning rate decreases over time. Some methods decrease the learning rate in discrete steps while others reduce it more smoothly. The choice of decay method can depend on the task, model and how quickly the learning rate needs to be reduced during training.

Common Types of Learning Rate Decay

There are various methods to reduce the learning rate each has a different approach to the process:

**1. Step Decay: In step decay, the learning rate is reduced by a fixed factor after a predetermined number of epochs. This method is simple but effective.

**Formula: lr = lr_{initial} * drop \;rate^{\frac{epoch}{step\;size}}

**2. Exponential Decay: It reduces the learning rate exponentially at each epoch, leading to a smooth decrease.

**Formula: lr = lr_{initial}\; * \; e^{-decay\;rate \;* \;epoch}

**3. Inverse Time Decay: A factor inversely proportional to the number of epochs is used to reduce the learning rate through inverse decay.

**Formula: lr = lr_{initial} \; * \; \frac{1}{1 + decay * epoch}

**4. Polynomial Decay: It decreases the learning rate based on a polynomial function of the epoch number. This offers a more controlled reduction over time.

**Formula: lr = lr_{initial} \; * \; \left ( 1 - \frac{epoch}{max\;epoch}\right )^{power}

Mathematical Representation of Learning Rate Decay

Understanding the mathematical foundation behind learning rate decay helps clarify how the learning rate is adjusted over time. A basic learning rate decay plan can be mathematically represented as follows:

Assume that the starting learning rate is \eta_{0} and that the learning rate at epoch t is \eta_{t}.

A typical decay schedule for learning rates is based on a constant decay rate \alpha where \alpha \epsilon (0,1), applied at regular intervals (e.g every n epochs). The formula for this basic decay is:

\eta_{t} = \frac{\eta_{0}}{1 + \alpha \cdot t}

Where:

In this equation:

This schedule provides the optimization by helping the model to converge more quickly at first with larger learning steps and then fine-tuning in smaller increments as it approaches the local minimum.

Implementing Learning Rate Decay

Here we will see how to implement learning rate decay in TensorFlow while building a neural network for classification on the MNIST dataset (a dataset of handwritten digits).

**1. Importing Libraries

We will be using TensorFlow for building and training the model, Keras which is a high-level API within TensorFlow for defining, training and evaluating models and Numpy libraries for this implementation.

Python `

import tensorflow as tf from tensorflow.keras.datasets import mnist from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Dense, Flatten from tensorflow.keras.optimizers import SGD from tensorflow.keras.callbacks import LearningRateScheduler import numpy as np

`

**2. Loading the MNIST Data

The MNIST dataset contains images of handwritten digits and we load it using **TensorFlow’s mnist.load_data() function. The dataset is then split into training and testing sets. Each pixel value is divided by 255.0 to normalize it into a range of 0 to 1.

Python `

(x_train, y_train), (x_test, y_test) = mnist.load_data() x_train, x_test = x_train / 255.0, x_test / 255.0

`

**3. Building the Model

We create a Sequential model using Keras. The model consists of:

model = Sequential([ Flatten(input_shape=(28, 28)), Dense(128, activation='relu'), Dense(10, activation='softmax')])

`

**4. Setting up Learning Rate Decay

We define a learning rate schedule that decreases the learning rate over time. Here, we use Exponential Decay where:

initial_lr = 0.1 lr_schedule = tf.keras.optimizers.schedules.ExponentialDecay( initial_learning_rate=initial_lr, decay_steps=1000,
decay_rate=0.96,
staircase=True)

`

**5. Compiling the Model

Now we compile the model with:

model.compile(optimizer=tf.keras.optimizers.SGD(learning_rate=lr_schedule), loss='sparse_categorical_crossentropy', metrics=['accuracy'])

`

**6. Callback to Print Learning Rate

We define a custom callback to print the learning rate at the start of each epoch. This will help us track how the learning rate changes during training due to the decay schedule.

Python `

class PrintLR(tf.keras.callbacks.Callback): def on_epoch_begin(self, epoch, logs=None): opt = self.model.optimizer if callable(opt.learning_rate): lr = opt.learning_rate(opt.iterations) else: lr = opt.learning_rate print(f"Epoch {epoch+1:02d}: learning rate = {tf.keras.backend.get_value(lr):.5f}")

`

7. Training the Model

The model is trained for 15 epochs. The training process uses the **x_train and **y_train datasets and we validate the model on the **x_test and **y_test datasets.

Python `

model.fit(x_train, y_train, epochs=15, validation_data=(x_test, y_test), callbacks=[PrintLR()], verbose=2)

`

**Output:

decay-1

Training the model

8. Evaluating the Model

After training, we evaluate the model’s performance on the test set. The test loss and accuracy are displayed to assess how well the model generalizes to unseen data.

Python `

test_loss, test_accuracy = model.evaluate(x_test, y_test, verbose=2) print('\nTest accuracy:', test_accuracy)

`

**Output:

decay2

Test accuracy

9. Result

Python `

current_lr = lr_schedule(model.optimizer.iterations) print(f"Current learning rate: {current_lr.numpy()}")

`

**Output:

Current learning rate: 0.031885575503110886

The learning rate will remain constant during the early epochs and gradually decrease according to the exponential decay schedule.

Advantages of Learning Rate Decay

Deep learning and machine learning models are frequently trained using the learning rate decay technique. It provides a number of benefits that support more effective and efficient training including:

Disadvantages of Learning Rate Decay

Despite many benefits to learning rate decay, it's important to be see of any potential drawbacks and difficulties while using it. Considerations and disadvantages are as follows:

By mastering learning rate decay, we can significantly improve our model's training efficiency, stability and generalization, ultimately leading to better performance and more reliable results.