Deep Learning Interview Questions (original) (raw)

Last Updated : 4 Oct, 2025

Deep Learning is a field of AI that trains multi-layered neural networks to learn from data. It is widely used in applications like vision, speech and NLP. This article shows all key Deep Learning interview questions to help you revise core concepts and advanced topics.

1. What is the difference between Deep Learning and Machine Learning?

Aspect Machine Learning (ML) Deep Learning (DL)
Definition Algorithms that learn from data and make predictions Subset of ML using multi-layered neural networks
Feature Engineering Requires manual feature extraction Learns features automatically from raw data
Data Requirement Works well with smaller datasets Needs large amounts of data for training
Training Time Relatively faster Computationally expensive and slower
Interpretability Easier to interpret and explain Harder to interpret (acts like a “black box”)
Applications Fraud detection, recommendations Image recognition, NLP, self-driving cars

2. What are the different types of Neural Networks?

There are different types of neural networks used in deep learning. Some of the most important neural network architectures are as follows:

  1. Feedforward Neural Networks (FFNNs)
  2. Convolutional Neural Networks (CNNs)
  3. Recurrent Neural Networks (RNNs)
  4. Long Short-Term Memory Networks (LSTMs)
  5. Gated Recurrent Units (GRU)
  6. Autoencoder Neural Networks
  7. Generative Adversarial Networks (GANs)
  8. Transformers
  9. Deep Belief Networks (DBNs)

3. **What is a Neural Network and Artificial Neural Network (ANN)?

A Neural Network is a computational model inspired by the human brain where nodes (neurons) are connected to process and transfer information. An Artificial Neural Network (ANN) is the basic implementation of this concept in machines. It consists of:

ANNs are widely used in deep learning for applications like image recognition, speech analysis and natural language processing.

Artificial Neural Network - Geeksforgeeks

artificial neural network

4. How Biological neurons are similar to the Artificial neural network.

Artificial Neural Networks (ANNs) are inspired by how biological neurons work in the human brain.

Biological neurons to Artificial neurons - Geeksforgeeks

Biological neurons to Artificial neurons

**5. What are Weights and Biases in Neural Networks?

**Example: For a neuron with inputs x_1, x_2 weights w_1, w_2 and bias b, the output before activation is:

z = (w_1 \cdot x_1) + (w_2 \cdot x_2) + b

Then an activation function like sigmoid or ReLU is applied on z to get the final output.

6. How weights are initialized in Neural Networks?

Weight initialization is a crucial step in training neural networks. The goal is to set the initial weights in a way that allows the network to learn efficiently and converge to a good solution. Several methods are commonly used:

**7. What is an Activation Function and how does it work in Neural Networks?

An Activation Function is a mathematical function applied to the output of a neuron. It decides whether the neuron should be activated (pass information) or not.

**How it works:

  1. Each neuron calculates a weighted sum of its inputs plus a bias.
  2. The activation function is applied to this value.
  3. It introduces non-linearity, which allows the network to learn complex patterns instead of just simple linear relationships.

**Common Activation Functions:

Without activation functions, a neural network would act like a simple linear regression model and fail to learn complex tasks like image recognition or NLP.

**8. What are the different types of Activation Functions used in Deep Learning?

9. What are the different layers in a Neural Network?

A neural network is made up of multiple layers, each having a specific role in processing data:

**1. Input Layer:

**2. Hidden Layers:

**3. Output Layer:

10. What is a Perceptron or a Single Layer Neural Network?

A Perceptron is the simplest type of artificial neural network model, introduced by Frank Rosenblatt in 1958. It is a single-layer neural network used for binary classification problems.

**Formula:

y = f\Big(\sum (w_i \cdot x_i) + b\Big)

where f is the activation function, x_i​ are inputs, w_i​ are weights and b is bias.

**Example: Used for simple problems like checking whether a number is greater than a threshold (yes/no type outputs).

11. What is Multilayer Perceptron and How it is different from a Single-Layer Perceptron?

A Multilayer Perceptron (MLP) is an extension of the simple perceptron that contains one or more hidden layers between the input and output layers. It is a type of feedforward neural network and is widely used in deep learning.

**Structure:

**Example: Handwriting recognition, image classification and speech recognition.

**12. How are the number of hidden layers and neurons per hidden layer selected?

There is no fixed rule for selecting hidden layers and neurons and they are chosen based on the complexity of the problem and are often tuned experimentally.

**Number of Hidden Layers

**Number of Neurons per Layer

In real projects, hidden layers and neurons are usually chosen by trial and error, cross-validation or automated methods like hyperparameter tuning.

**13. What is the difference between Shallow Networks and Deep Networks?

**1. Shallow Networks

**2. Deep Networks

**14. Why are Neural Networks Called Black Boxes?

Neural networks are often referred to as black boxes because their internal workings are not easily interpretable. While they can learn complex patterns and make highly accurate predictions, it is usually difficult to understand how inputs are transformed into outputs.

15. What are Feedforward Neural Networks?

Feedforward Neural Network is the simplest type of artificial neural network where the data flows only in one direction i.e from the input layer to hidden layers and then to the output layer.

16. Are ANN, Single Layer Perceptron and Feedforward Neural Network the same?

They are related concepts but not exactly the same:

In short we can say that:

17. What is forward and backward propagation?

**Forward Propagation:

**Backward Propagation (Backpropagation):

18. What is the cost function in deep learning?

A Cost Function in deep learning measures the difference between the predicted output of the model and the actual target values. It helps the network learn by guiding weight and bias updates during backpropagation. The aim is to minimize the cost so that predictions get closer to the actual results.

**Commonly used cost functions:

**19. What is Binary Cross-Entropy, Categorical Cross-Entropy and Sparse Categorical Cross-Entropy?

**1. Binary Cross-Entropy (BCE)

**2. Categorical Cross-Entropy (CCE)

**3. Sparse Categorical Cross-Entropy (SCCE)

20. How do neural networks learn from the data?

Neural networks learn from data through an iterative process of forward propagation, error calculation and backpropagation.

  1. **Forward Propagation: Input data passes through the network and each neuron applies weights, biases and activation functions to produce an output.
  2. **Error Calculation: The network compares its output with the actual target values using a cost function to measure the difference (error).
  3. **Backpropagation: The error is propagated backward through the network. Gradients of the cost function with respect to weights and biases are calculated using the chain rule.
  4. **Weight and Bias Update: Optimization algorithms like Gradient Descent update the weights and biases to reduce the error.
  5. **Iteration: Steps 1–4 are repeated over multiple epochs until the network achieves minimal error and can generalize well on new data.

Neural networks learn by adjusting their parameters (weights and biases) to minimize the error between predicted and actual outputs.

21. What is Gradient Descent and its Variants?

Gradient Descent is an optimization algorithm used in neural networks to minimize the cost function by iteratively updating the weights and biases in the opposite direction of the gradient. The gradient indicates the slope of the error surface and a parameter called the learning rate (lr) controls the size of each step taken. A large Learning Rate can overshoot the minimum while a very small Learning Rate can make training extremely slow. By moving against the gradient with an appropriate Learning Rate, the error is gradually reduced until it reaches a minimum.

ezgif

Gradient Descent

The gradient of the cost function with respect to each parameter is calculated and the parameters are updated using the formula:

\theta = \theta - \eta \cdot \frac{\partial L}{\partial \theta}

where

**Variants of Gradient Descent:

**1. Batch Gradient Descent:

**2. Stochastic Gradient Descent (SGD):

**3. Mini-Batch Gradient Descent:

**4. Gradient Descent with Momentum:

**5. Adaptive Methods:

22. Define the learning rate in Deep Learning.

The learning rate (lr) is a hyperparameter in deep learning that controls how much the model’s weights are adjusted during each update step in training. It determines the size of the step taken in the direction opposite to the gradient of the loss function.

General weight update rule:

w=w−η∇L(w)

Where:

23. Difference between Batch Gradient Descent, Stochastic Gradient Descent and Mini-Batch Gradient Descent?

Aspect Batch Gradient Descent Stochastic Gradient Descent (SGD) Mini-Batch Gradient Descent
Data Used per Update Entire training dataset One training example Small subset (batch) of training data
Speed Very slow for large datasets Very fast per update Faster than batch, slower than SGD
Convergence Stable and accurate Noisy, may fluctuate More stable than SGD and less costly than batch
Memory Requirement Very high (needs whole dataset in memory) Very low (one sample at a time) Moderate (depends on batch size)
Practical Usage Rarely used for large-scale tasks Rarely used alone in deep learning Most widely used in practice

24. Explain Adagrad, RMSProp and Adam Optimizer.

**1. Adagrad (Adaptive Gradient Algorithm)

**It adjusts the learning rate for each parameter based on how frequently it is updated. It works well with sparse data like in NLP or recommendation systems.

**Working:

The learning rate keeps decreasing and eventually becomes very small and hence training may stop too early.

**2. RMSProp (Root Mean Square Propagation)

It Fixes Adagrad’s problem by using a moving average of squared gradients instead of the sum. It works well for non-stationary data like RNNs in sequence tasks.

**Working:

We need to carefully tune hyperparameters like learning rate and decay factor.

**3. Adam (Adaptive Moment Estimation)

It combines the benefits of Momentum and RMSProp. It is the default optimizer in deep learning, offering fast convergence and working well for large datasets and parameters.

**Working:

Sometimes it can lead to overfitting or poor generalization if not tuned.

25. What is Momentum-based Gradient Descent?

Momentum-based Gradient Descent is an optimization method that accelerates learning by adding a fraction of the previous update (velocity) to the current gradient. This reduces oscillations and helps the model converge faster.

Formula is:

v=βv−η∇L(w)

w=w+v

Where:

This allows faster convergence, especially in ravines or areas with steep slopes in one direction and flat in another.

**26. What is the Vanishing and Exploding Gradient Problem?

In deep neural networks, during backpropagation, gradients are propagated backward through many layers. Depending on how the weights and activations behave, gradients can either become extremely small (vanish) or extremely large (explode).

**Vanishing Gradient:

**Exploding Gradient:

**Solutions:

**27. What is Gradient Clipping?

Gradient Clipping is a technique used to prevent the exploding gradient problem during training of deep neural networks. When gradients become too large, weight updates can be unstable, causing the model to diverge. With gradient clipping, we set a threshold value and if the gradient exceeds this value, it is scaled down to stay within the limit.

Two common approaches:

28. Define Epoch, Iterations and Batches.

**1. Batch:

**2. Iteration:

**3. Epoch:

**29. How to Avoid Overfitting in Neural Networks?

Overfitting happens when a neural network memorizes the training data instead of learning general patterns. Some common techniques to reduce overfitting are:

**30. What is Dropout and Early Stopping in Neural Networks?

**1. Dropout

**2. Early Stopping

**31. What is Data Augmentation and Its Techniques?

Data Augmentation is a technique used to artificially increase the size and diversity of a training dataset by applying various transformations to existing data. It helps neural networks generalize better and reduces overfitting.

**Common Techniques in Image Data Augmentation:

**For Text or NLP:

**For Time-Series Data:

**32. What is Batch Normalization?

Batch Normalization (BN) is a technique used in neural networks to normalize the inputs of each layer so that they have a consistent distribution during training. This reduces the problem of internal covariate shift, where the input distribution to a layer keeps changing as previous layers update, which can slow down training.

**How it works:

33. **What is CNN (Convolutional Neural Network)?

A Convolutional Neural Network (CNN) is a type of deep learning model mainly used for image recognition, computer vision and pattern detection. Unlike traditional neural networks, CNNs automatically detect important features (edges, textures, shapes) from raw data without manual feature engineering.

**Key parts of CNN:

**Applications: Image classification, face recognition, self-driving cars, medical imaging, NLP (with 1D convolutions), etc.

**34. What do you mean by Convolution?

**Example:

35. What is a kernel?

A **kernel also called a **filter in deep learning is a small matrix of numbers used in the convolution operation of a CNN. It is a weight matrix (for example 3×3 or 5×5) that slides over the input (like an image) to extract features. During convolution, the kernel performs element-wise multiplication with the part of the input it overlaps and the results are summed to form a feature map.

Suppose you have a 3×3 kernel:

K = \begin{bmatrix} 1 & 0 & -1 \\ 1 & 0 & -1 \\ 1 & 0 & -1 \end{bmatrix}

36. Define stride.

Stride is the number of steps the kernel (filter) moves across the input matrix during convolution. It is the is the step size of the filter movement, deciding how much the filter shifts across the input. By default, stride = 1, meaning the kernel shifts one cell at a time (both horizontally and vertically).

Suppose you have a 5×5 image and a 3×3 kernel:

37. What is Pooling Layer and its different types?

The Pooling Layer is used in Convolutional Neural Networks (CNNs) to reduce the size of feature maps while keeping the important information. It makes the network faster, prevents overfitting and helps capture dominant features. Pooling works by sliding a small window (like 2×2) over the feature map and summarizing the values inside it.

**Types of Pooling:

**1. Max Pooling

**2. Min Pooling

**3. Average Pooling

**4. Global Pooling

38. **What is Padding in CNN?

In CNNs, Padding means adding extra rows and columns (usually zeros) around the input matrix before applying convolution. It is used:

**Types of Padding:

**1. Valid Padding (No Padding)

**2. Same Padding (Zero Padding)

**3. Full Padding

**4. Reflective / Replication Padding (less common)

39. What is the difference between object detection and image segmentation?

Feature Object Detection Image Segmentation
**Definition Identifies objects and their locations in an image. Assigns a class label to each pixel in the image.
**Output Bounding boxes with class labels. Pixel-wise classification mask (colored regions).
**Granularity Object-level understanding. Pixel-level understanding (detailed boundaries).
**Use Cases Face detection, pedestrian detection, vehicle detection. Medical imaging, self-driving cars, satellite image analysis.
**Algorithms YOLO, SSD, Faster R-CNN, R-CNN. U-Net, Mask R-CNN, DeepLab, FCN.

40. What are Recurrent Neural Networks (RNNs) and How it works?

A Recurrent Neural Network (RNN) is a type of neural network designed to work with sequential data like text, speech, time series or video frames. Unlike traditional neural networks, RNNs have a memory of previous inputs, which helps them capture context and dependencies in sequences.

h_t = f(Uh_{t-1} + Wx_t + b)

Where:

And the output of the RNN at each time step will be:

y_t = g(Vh_t + c)

Where:

Here, W, U, V, b and c are the learnable parameters and it is optimized during the backpropagation.

**RNN Working

nfa

RNN

**1. Sequential Processing:

**2. Hidden State (Memory):

**3. Weights Sharing:

**4. Output:

41. How does the Backpropagation through time work in RNN?

Backpropagation Through Time (BPTT) is the method used to train Recurrent Neural Networks (RNNs) by applying the backpropagation algorithm across time steps. It allows the network to learn temporal dependencies in sequential data.

**BPTT working:

**1. Forward Pass:

**2. Unrolling the RNN:

**3. Error Calculation:

**4. Backward Pass (Through Time):

**5. Weight Update:

**6. Iteration:

BPTT enables the RNN to remember past information and adjust its weights to better predict future elements in a sequence, making it ideal for tasks like language modeling, time-series forecasting and speech recognition.

42. What is Vanishing and Exploding gradient problems in traditional RNNs?

**Vanishing Gradient Problem

Network forgets earlier inputs hence leading to poor performance on long sequences.

**Exploding Gradient Problem

Loss function may diverge and the network fails to learn.

**Solutions:

43. What is LSTM and How it works?

LSTM (Long Short-Term Memory) is a type of Recurrent Neural Network (RNN) designed to overcome the vanishing and exploding gradient problems in traditional RNNs. It is especially good at learning long-term dependencies in sequential data.

Unlike RNNs, which have a single hidden state, LSTMs have a memory cell that can retain information over long sequences. This memory is controlled by gates that regulate the flow of information.

LSTM

LSTM Model architecture

**Key Components of an LSTM Cell:

**1. Cell State (C):

**2. Hidden State (h):

**3. Forget Gate (f):

**4. Input Gate (i):

**5. Output Gate (o):

**LSTM Working:

  1. **Forget step: Decide what to remove from the previous cell state.
  2. **Input step: Decide what new information to add to the cell state.
  3. **Update cell state: Combine the forget and input steps to update the memory.
  4. **Output step: Decide what information to output to the next cell or layer.

44. What is BiRNN and BiLSTM?

**1. BiRNN (Bidirectional RNN)

**BiLSTM (Bidirectional LSTM)

45. What is GRU and How it works?

GRU (Gated Recurrent Unit) is a type of Recurrent Neural Network (RNN) similar to LSTM but with a simpler architecture. It is designed to capture long-term dependencies in sequential data while being computationally more efficient than LSTM.

Unlike LSTM, which has three gates (input, forget, output), a GRU has two gates and no separate cell state. The hidden state serves as both memory and output.

GRU

GRU Model architecture

**Key Components of a GRU Cell:

**1. Update Gate (z):

**2. Reset Gate (r):

**3. Candidate Hidden State (ht~\tilde{h_t}ht​~​):

**4. Final Hidden State (h):

**GRU Working:

  1. **Reset Gate: Decide which past information to forget.
  2. **Candidate Hidden State: Compute new information based on the reset-modified previous hidden state and current input.
  3. **Update Gate: Decide how much of the candidate hidden state to keep and how much of the previous hidden state to retain.
  4. **Hidden State Update: Combine previous hidden state and candidate to produce the current hidden state.

How GRU is better than LSTM

46. Difference between RNN, LSTM and GRU

Feature RNN LSTM GRU
**Architecture Single hidden state Hidden state + cell state Single hidden state with gating
**Gates None 3 gates: input, forget, output 2 gates: update, reset
**Memory Limited memory, struggles with long-term dependencies Can retain long-term dependencies Can retain long-term dependencies
**Gradient Problem Introduces to vanishing/exploding gradients Mitigates vanishing/exploding gradients Mitigates vanishing/exploding gradients
**Computational Complexity Low Higher due to more gates Lower than LSTM
**Performance on Long Sequences Poor Good Comparable to LSTM, often faster
**Use Cases Short sequences, simple tasks Long sequences, NLP, time-series Long sequences, NLP, time-series, faster training

47. What is the Transformer model?

The Transformer is a neural network architecture that relies on the attention mechanism to efficiently capture long-range dependencies in sequences. Unlike traditional RNNs, it processes sequences in parallel which makes it faster and more effective for tasks in NLP such as machine translation, text summarization, question answering and word embedding.

**Key Components of the Transformer:

**1.Self-Attention Mechanism:

**2. Encoder-Decoder Architecture:

**3. Multi-Head Attention:

**4. Positional Encoding:

**5. Feed-Forward Neural Networks:

**6. Layer Normalization and Residual Connections:

48. What is Attention Mechanism?

The Attention Mechanism is a technique in neural networks that allows the model to focus on the most relevant parts of the input sequence when making predictions. Instead of treating all inputs equally, attention assigns different weights to different parts of the input, helping the model capture important dependencies more effectively.

**How It Works:

**1. Assigning Weights:

**2. Weighted Sum:

**3. Context Vector:

49. What are different types of attention mechanisms?

**1. Global (Soft) Attention:

**2. Local (Hard or Windowed) Attention:

**3. Self-Attention:

**4. Scaled Dot-Product Attention:

**5. Multi-Head Attention:

**50. What is Positional Encoding?

Positional Encoding is a technique used in Transformer models to provide information about the order of tokens in a sequence. Since Transformers process all input tokens in parallel (unlike RNNs), they have no inherent sense of sequence order. Positional encoding solves this by adding position-specific information to the token embeddings.

**51. What are Layer Normalization and Residual Connections?

**1. Layer Normalization (LayerNorm):

**2. Residual Connections (Skip Connections):

52. What are Tokens and Embeddings?

**1. Tokens

Example:

**2. Embeddings

Example:

53. What is an Encoder-Decoder network in Deep Learning?

An Encoder-Decoder network is a neural network architecture that learns to map an input sequence to an output sequence, which may have a different length and structure. It is used in Machine Translation, Text Summarization, Chatbots and Image Captioning. It consists of two main components: encoder and decoder.

**1. Encoder

**2. Decoder

**Training

54. What is an Autoencoder?

An Autoencoder is a type of neural network designed to learn efficient representations of data (encoding) by training the network to reconstruct its input at the output. It is commonly used for dimensionality reduction, feature learning, Anomaly Detection and data denoising.

**Key Components:

**1. Encoder:

**2. Latent Space:

**3. Decoder:

55. What are dfferent types of Autoencoder?

**1. Vanilla (Basic) Autoencoder:

**2. Denoising Autoencoder (DAE):

**3. Sparse Autoencoder:

**4. Variational Autoencoder (VAE):

**5. Convolutional Autoencoder (CAE):

**6. Contractive Autoencoder (CAE):

**56. What is a Variational Autoencoder (VAE)?

A Variational Autoencoder (VAE) is a type of probabilistic autoencoder that learns to model the underlying probability distribution of the input data. Unlike a standard autoencoder that maps inputs to a fixed latent vector, a VAE maps inputs to a distribution in the latent space, allowing it to generate new, realistic data samples by sampling from this distribution.

**57. What is a Seq2Seq Model?

A Sequence-to-Sequence (Seq2Seq) model is a type of neural network architecture designed to map an input sequence to an output sequence, where the lengths of the input and output may differ. It is widely used in natural language processing (NLP) tasks such as machine translation, text summarization, Speech Recognition and chatbots.

**Key Components:

**1. Encoder:

**2. Decoder:

**3. Attention Mechanism (Optional but Common):

58. What is a Generative Adversarial Network (GAN)?

A Generative Adversarial Network (GAN) is a type of neural network architecture used to generate realistic data that resembles a given dataset. It consists of two neural networks i.e a Generator and a Discriminator. They compete with each other in a game-like setup, which helps the generator produce increasingly realistic outputs.

**1. Generator:

**2. Discriminator:

**How It Works:

The generator and discriminator are trained simultaneously:

This adversarial training continues until the generator produces data that the discriminator cannot reliably distinguish from real data. It is used in image generation, data augmentation, etc.

59. Different types of Generative Adversarial Networks (GANs)?

**1. Vanilla GAN (Basic GAN):

**2. Conditional GAN (cGAN):

**3. Deep Convolutional GAN (DCGAN):

**4. Wasserstein GAN (WGAN):

**5. Least Squares GAN (LSGAN):

**6. CycleGAN:

**7. Progressive GAN (PGGAN):

**8. StyleGAN:

**60. What is StyleGAN?

StyleGAN is a type of Generative Adversarial Network (GAN) designed for high-quality image generation with fine-grained control over features of the generated images. It was developed by NVIDIA and is widely known for producing photorealistic human faces and other high-resolution images.

**1. Style-Based Generator:

**2. Adaptive Instance Normalization (AdaIN):

**3. Separation of Features:

**4. High-Resolution Image Generation:

**61. What is Transfefine Learning and Fine-Tuning?

**1. Transfer Learning:

**2. Fine-Tuning:

**Example:

  1. Take a CNN pre-trained on ImageNet.
  2. Freeze the first few convolutional layers.
  3. Retrain the last few layers on a dataset of dog breeds.

62. What is the Difference Between Transfer Learning and Fine-Tuning?

Aspect **Transfer Learning **Fine-Tuning
**Definition Using a pre-trained model on a new, related task without modifying its internal weights much. Adapting a pre-trained model to a new task by retraining some or all layers on the new dataset.
**Training Usually, only the final layer(s) are trained for the new task. Some layers are frozen and others are retrained to learn task-specific features.
**Purpose Leverage existing general features learned from a large dataset. Adjust the model to better fit the specifics of the new dataset.
**When Used When the new dataset is small or similar to the original dataset. When the new task is related but slightly different and requires more adaptation.
**Example Using ImageNet-pretrained CNN to classify cats vs dogs by just replacing the final layer. Using ImageNet-pretrained CNN, freezing early layers and retraining later layers on a small dog breed dataset.