Knowledge Distillation (original) (raw)

Last Updated : 23 Jul, 2025

Knowledge Distillation is a model compression technique in which a smaller, simpler model (student) is trained to imitate the behavior of a larger, complex model (teacher). Instead of learning directly from data, the student model learns from the soft targets or probabilities, which are produced by the teacher model. This technique helps in deploying deep models on edge devices like mobiles or IOT. Let's explore more about Knowledge Distillation and its Working.

teacher_student_model_for_knowledge_distillation

An Illustrative model for Knowledge Distillation

**Key Features of Knowledge Distillation

  1. **Model Compression: Reduces model size without much loss in accuracy.
  2. **Performance Retention: Maintains the accuracy of large models.
  3. **Faster Inference: Compressing a large BERT model into a smaller one like DistilBERT for fast inference in NLP tasks. Student model is lightweight and runs faster.
  4. **Soft Target Learning: Uses logits or softmax outputs instead of hard labels.
  5. **Teacher-Student Framework: Knowledge is transferred from teacher to student.
  6. **Regularization Effect: Reduces overfitting in student models.

**Types of Knowledge Distillation

The image above displays the representation of various Types of Knowledge Representation. These types are Response-based, Feature-based, and Relation-based Knowledge Representation.

1. **Response-based Distillation

Response-based-Knowledge-Distillation

Response-based Distillation

This is the most classic and widely used form of knowledge distillation. It transfers the softened output probabilities (logits) from the teacher model to the student. These outputs provide richer information than hard labels because they reflect class similarities learned by the teacher.

2. Feature-based Distillation

Feature-based-Knowledge-Distillation

Feature-based Distillation

Instead of using final outputs, this method transfers intermediate representations or feature maps from the teacher to the student. The idea is to guide the student to learn similar internal feature structures.

3. Relation-based Distillation

Relation-based-Knowledge-Distillation

Relation-based Distillation

This technique focuses on the relationships between different input samples, such as pairwise distances or similarities. The goal is to maintain the relative structure of the learned space.

Working of Knowledge Distillation

Input Data --> Teacher (Large model) --> (Soft Targets / Logits) --> Student (Small model) --> Combined Loss (Distillation + True Label)

Terminologies

Steps to Implement Knowledge Distillation

Knowledge Distillation is a model compression technique where a smaller, simpler model (student) is trained to replicate the behavior of a larger, complex model (teacher).

  1. **Train the Teacher Model on the dataset: A large model (like BERT, ResNet) or Teacher is trained on the dataset.
  2. **Generate Soft Targets: Pass training data through the teacher to get softmax outputs (logits/soft targets).
  3. **Train the Student Model: The student learns from a combination of:
  4. **Use a Temperature Parameter
    • Softmax outputs are softened using a temperature parameter (T>1) to emphasize the probabilities of incorrect classes that still carry meaningful knowledge
    • Student may also mimic the intermediate representations or relationships between data samples from the teacher

**Formula: q_i = \frac{\exp(z_i / T)}{\sum_j \exp(z_j / T)}

Where,

**Note: We can Optimize the Combined Loss Function:

\text{Loss} = \alpha \cdot \text{CE}(y_{\text{true}}, y_{\text{student}}) + \beta \cdot \text{KL}(y_{\text{teacher}}, y_{\text{student}})

Where,

This process helps the student model generalize better despite having fewer parameters. It is widely used in deep learning for model compression, efficiency, and deployment on edge devices.

Different Training Strategies of Knowledge Distillation

Schemes-for-Knowledge-Distillation

Different Training Strategies of Knowledge Distillation

**1. Offline Knowledge Distillation

**2. Online Knowledge Distillation

**3. Self-Distillation

**Applications of Knowledge Distillation

Knowledge Distillation is used in scenarios where high performance is needed in resource-constrained environments. Some of its applications are:

  1. **Mobile & Edge AI: Compress models for deployment on smartphones and IoT.
  2. **Faster transformer models: Some of these are DistilBERT and TinyBERT.
  3. **Computer Vision: Use in object detection, segmentation with smaller CNNs.
  4. **Speech Recognition: Speed up large acoustic models.
  5. **Compression: Combine multiple models into one student.
  6. **Robotics: Embed efficient models for fast decision-making.
  7. **Medical Imaging: Deploy high-accuracy models under hardware constraints.

**Advantages

  1. Speeds up inference time.
  2. Reduces model size significantly.
  3. Improves generalization by transferring "dark knowledge" (class relationships).
  4. Makes deployment on edge devices feasible.
  5. Multiple Teacher models can be ensembled. Knowledge Distillation can be combined with pruning.
  6. Student models are easier to interpret. Learns richer knowledge (e.g., class similarities).

**Disadvantages

  1. Requires a well-trained teacher model.
  2. Soft labels may not always provide useful guidance.
  3. Choosing temperature and loss weights is confusing.
  4. Doesn’t guarantee performance retention.
  5. Complex setup for feature or relation-based distillation.