ZeroShot Learning for Novel Class Recognition using CLIP Model (original) (raw)

Last Updated : 7 Aug, 2025

Zero-shot learning (ZSL) allows models to classify objects they’ve never seen by using semantic information. The Contrastive Language-Image Pretraining (CLIP) model represents a significant advancement in zero-shot learning. Unlike traditional deep learning models that are limited to a fixed set of output classes, CLIP can generalize to new tasks and classes by combining image and text embeddings. It aligns image and text representations in a shared space via contrastive learning:

Step-by-Step Implementation

Lets see the step-by-step implementation of Zero-shot learning for Novel Class recognition using CLIP model,

Step 1: Install and Import Libraries

We will install and import all the required libraries,

!pip install transformers torch pillow

from transformers import CLIPProcessor, CLIPModel from PIL import Image import torch

`

Step 2: Load the Pre-trained CLIP Model

Load a pre-trained instance of the CLIP model from OpenAI's model hub. This model has been trained on a variety of images and their descriptions, making it capable of generalizing from text to unseen visual content.

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32") processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

`

**Output:

pretrained-model

Loading the Pre-Trained Model

Step 3: Load the Image

The used sample can be downloaded from here.

Upload an image to be classified. We can use a direct file path if we are working locally or use a feature like Colab's upload feature if we are working in a notebook environment.

image = Image.open("/content/dog.jpg")

`

Step 4: Define Class Labels and Preprocess Data for the Model

Define a list of text descriptions that represent the classes we are going to classifying. These labels are used by the CLIP model to compare against the image. The preprocessing step tokenizes text, resizes/crops the image and bundles everything up as PyTorch tensors.

labels = ["a photo of a dog", "a photo of a cat", "a photo of a car"]

inputs = processor(text=labels, images=image, return_tensors="pt", padding=True)

`

Step 5: Perform Zero-Shot Classification

Pass the processed inputs to the CLIP model.

outputs = model(**inputs) logits_per_image = outputs.logits_per_image probs = logits_per_image.softmax(dim=1)

`

Step 6: Get the Predicted Class

Determine the predicted class by finding the label with the highest probability. This step concludes the classification process, providing a zero-shot learning-based prediction.

Python `

predicted_class = labels[probs.argmax()] print(f"Predicted class: {predicted_class}")

`

**Output:

Predicted class: a photo of a dog

Application of CLIP in Novel Class Recognition

Zero-shot learning using CLIP can be highly impactful in various industries. Some applications include:

Strengths of CLIP for Zero-Shot Learning

  1. **Generalizes to Unseen Classes: Can accurately recognize new categories or objects without retraining, by leveraging descriptive text prompts.
  2. **No Task-Specific Data Needed: Eliminates the need for additional labeled datasets or fine-tuning for each new task or class—just supply new textual descriptions.
  3. **Cross-Domain Flexibility: Handles a wide variety of tasks (classification, style recognition, etc.) across domains by understanding both images and language.
  4. **Efficient with Minimal Data: Performs well even in few-shot or zero-shot scenarios, making it effective where data is limited or rapidly changing.

Limitations