Vision Transformer (ViT) Architecture (original) (raw)

Last Updated : 20 Dec, 2025

Vision Transformer (ViT) is a deep learning architecture that applies the Transformer model to images. Instead of relying on convolutions, ViTs use self-attention to capture relationships across all image patches, enabling a global understanding of the image. This approach has achieved state-of-the-art results in various computer vision tasks.

Vision Transformer (ViT) Architecture Overview

Instead of processing words, ViT treats an image as a sequence of fixed-size patches and applies self-attention across them. This allows the model to capture long range dependencies between different parts of an image without relying on convolution operations.

Vision-Transformer-Architecture_

ViT Architecture

ViT architecture includes the following major components:

1. Image Patching and Embedding

This stage converts a 2D image into a sequence of patch embeddings, analogous to tokens in NLP. It forms the input for the Transformer by turning spatial information into a linear sequence.

split_image_patches_

Patch Splitting

Patch embeddings can also be extracted using a convolution layer with kernel size and stride equal to the patch size, making each convolution act as a patch extractor.

2. Positional Encoding

Since Transformers are permutation invariant, positional encodings inject spatial order so the model knows the relative positions of patches.

patch_position_embedding

Positional Encoding

3. Adding the Classification Token (CLS Token)

A learnable CLS token is prepended to the patch sequence to aggregate information from all patches, serving as the image-level representation for classification.

4. Transformer Encoder (Pre-LayerNorm Architecture)

mlp

Transformer Encoder

Pre-LayerNorm applies LayerNorm before both the attention and feed-forward blocks. This stabilizes gradient flow and prevents the exploding/vanishing gradient problem in deep Transformers.

\text{LayerNorm}(x) = \frac{x - \mu}{\sigma} \odot \gamma + \beta

Where

Each Encoder Block has:

5. Multi-Head Self-Attention (MSA)

Allows each patch to attend to every other patch to model global dependencies, capturing relationships between distant image regions.

**1. **Self-Attention Mechanism

Self-attention enables each patch to relate to all others by using query, key and value projections with the attention matrix controlling token influence. The input sequence consists of N image patches plus 1 CLS token, with each token represented by a D-dimensional embedding.

Compute Queries, Keys and Values

Q = XW_Q, K = XW_K, V = XW_V

where W_Q, W_K, W_V are learnable weight matrices for linear projections

Attention(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

**2. **Multi-Head Attention

Multiple attention heads allow the model to attend to different types of information simultaneously. The outputs of all heads are concatenated and linearly projected to form the final attention output. This parallel attention mechanism leads to richer and more diverse feature representations.

\text{MSA}(\mathbf{X}) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)\mathbf{W}_O

Multiple heads (hhh) allow the model to focus on different types of relationships simultaneously (e.g., edges, color, textures, global shapes)

\text{head}_i = \text{Attention}(\mathbf{X}\mathbf{W}_Q^i, \mathbf{X}\mathbf{W}_K^i, \mathbf{X}\mathbf{W}_V^i)

6. Feed-Forward Network (FFN)

The FFN transforms each patch embedding to a higher-dimensional space and back using two dense layers with a GELU activation, enabling complex feature learning. It operates independently on each token with shared weights, allowing efficient non-linear transformations.

\text{FFN}(\mathbf{x}) = \mathbf{W}_2 \text{GELU}(\mathbf{W}_1 \mathbf{x} + \mathbf{b}_1) + \mathbf{b}_2

Expands and transforms features for better expressiveness. GELU Activation is used for smooth non-linearity improves learning and stability.

7. Residual Connections and Layer Normalization

Ensures stable training in deep networks by preserving information and normalizing activations.

8. Classification Head (MLP Head)

Converts the CLS token output into class probabilities using a small feed-forward network.

9. Training Vision Transformers

ViTs need more data than CNNs due to low inductive bias and training involves pretraining on large datasets followed by finetuning.

Vision Transformer (ViT) vs. Convolutional Neural Networks (CNNs)

Here we compare ViT with CNN

Features CNNs ViTs
Attention Scope Capture local features via convolutions Capture global relationships via self-attention
Inductive Bias Strong biases (locality, translation invariance) Minimal biases, more flexible but data-hungry
Data Requirement Work well with small datasets Need large datasets for best performance
Feature Learning Learn hierarchical features Learn context-rich, long-range features

Advantages

Limitations