Swin Transformer (original) (raw)

Last Updated : 13 Aug, 2025

The Swin Transformer (Shifted Window Transformer) is a type of vision transformer model that processes images by dividing them into small, non-overlapping windows and computes self-attention within these localized regions. Unlike standard vision transformers which use global attention, Swin Transformer introduces a "shifted window" technique. This allows neighboring windows to interact with each other in subsequent layers, efficiently capturing both local and global features in an image.

Architecture and Working of Swin Transformer

The Swin Transformer’s architecture is built on a combination of hierarchical design and window-based self-attention for efficient working and feature extraction.

Architecture

Hierarchical Design of Swin Transformer

**Here's how it works:

**Patch Splitting: The input image is divided into fixed-size patches like putting a grid over image and each square represent a patch. Each patch is then embedded into a feature vector to form input for the transformer.

**Window-Based Self-Attention: Instead of computing attention globally the model computes attention within local windows. These windows act as small focused regions capturing fine features while keeping computation manageable. Self-attention is applied within the window and captures local features.

**Shifted Windows for Cross-Region Interaction: The shifted window mechanism solve limitation of local windows attention and capture global context of image. This shifted window shifts the position of the windows by a small value and hence overlapping regions with next layer. This ensure cross-window communication and improve models ability to capture global context.

**Hierarchical Design: The Swin Transformer processes the image in stages:

By combining local self-attention within windows and hierarchical processing makes it scalable for high-resolution image processing without excessive computing power. It can be used for various tasks like image classification, object detection and segmentation.

Implementation of Swin Transformer

Let's implement Swin Transformer step-by-step,

Step 1. Setup Environment

Install the necessary libraries:

!pip install transformers datasets torch torchvision

`

Step 2. Import Libraries

Import the following libraries:

from transformers import AutoImageProcessor, SwinForImageClassification from datasets import load_dataset import torch

`

Step 3. Load Pre-Trained Model

Define the model name and load the pre-trained Swin Transformer model along with its image processor:

model_name = "microsoft/swin-tiny-patch4-window7-224" image_processor = AutoImageProcessor.from_pretrained(model_name) model = SwinForImageClassification.from_pretrained(model_name)

`

**Output:

pretrained-model

Loading Pre-trained Model

Step 4. Load Dataset

Load the CIFAR-10 dataset, focusing on a subset for testing:

Python `

dataset = load_dataset("cifar10", split="test[:8]")

`

**Output:

dataset-extraction

Loading Dataset

Step 5. Extract Images and Labels

Extract the images and corresponding true labels:

images = [item["img"] for item in dataset] labels = [item["label"] for item in dataset]

`

Step 6. Preprocess Images

Preprocess the images using the AutoImageProcessor to prepare them as tensors:

inputs = image_processor(images, return_tensors="pt").to(model.device)

`

Step 7. Classify Images

Set the model to evaluation mode and classify the images:

model.eval() with torch.no_grad(): outputs = model(**inputs) logits = outputs.logits

`

Step 8. Process Predictions

Get the predicted labels from the model’s output logits:

predicted_labels = logits.argmax(dim=-1).cpu().numpy()

`

Step 9. Handle Label Mismatches

Handle cases where the model's label space does not match CIFAR-10’s labels:

num_classes = len(model.config.id2label) if num_classes != len(set(labels)): print("Warning: Model label space does not match CIFAR-10 labels. Mapping may be required.") class_mapping = {i: i % 10 for i in range(num_classes)} predicted_labels = [class_mapping[label] for label in predicted_labels]

`

Step 10. Map Predictions to Class Names

Map the predicted and true label indices to their human-readable class names:

Python `

class_names = [ "airplane", "automobile", "bird", "cat", "deer", "dog", "frog", "horse", "ship", "truck" ] predicted_class_names = [class_names[label] for label in predicted_labels] true_class_names = [class_names[label] for label in labels]

`

Step 11. Print Results

Display the results by comparing true and predicted class names for each image:

Python `

for i, (true_label, predicted_label) in enumerate(zip(true_class_names, predicted_class_names)): print( f"Image {i + 1}: True Label = {true_label}, Predicted Label = {predicted_label}")

`

**Output:

Image 1: True Label = cat, Predicted Label = cat
Image 2: True Label = ship, Predicted Label = ship
Image 3: True Label = ship, Predicted Label = ship
Image 4: True Label = airplane Predicted Label = bird
Image 5: True Label = frog, Predicted Label = frog
Image 6: True Label = frog, Predicted Label = ship
Image 7: True Label = automobile, Predicted Label = automobile
Image 8: True Label = frog, Predicted Label = frog

It shows use of Swin Transformer model for image classification without fine-tuning on the CIFAR-10 dataset. While the model accurately predicted common classes like "cat", "ship", "frog" and "automobile" there are some wrong predictions like confusing between "airplane" with "bird".

Applications of Swin Transformer

Advantages

Limitations