CLIP (Contrastive LanguageImage Pretraining) (original) (raw)

CLIP (Contrastive Language-Image Pretraining)

Last Updated : 23 Jul, 2025

CLIP or Contrastive Language-Image Pretraining is an advanced AI model developed by OpenAI and UC Berkeley. It has the unique ability to understand and relate both textual descriptions and images. It uses a novel training method that contrasts pairs of images and text which makes it highly useful tool for various real-world applications. In this article, we’ll see the fundamentals of how CLIP works, its innovative approach and its other core concepts.

What Makes CLIP Different?

CLIP is designed to understand the relationship between images and text. Unlike traditional models, it doesn't generate captions for images. Instead, it finds whether a given text description fits a particular image. In simpler terms, it matches images with relevant descriptions and tells us whether a description is a good fit for an image or not.

Before CLIP, state-of-the-art (SOTA) image classification models were limited to specific categories they were trained on. If we wanted to classify an image into a new category, we needed to fine-tune the model which required both computational resources and high-quality datasets. But CLIP’s most important innovations is its ability to perform zero-shot learning.

This means that once trained, it can classify images into any category without needing to have been explicitly trained on that category. It is done by using both images and text in training, it can classify images into categories it wasn’t explicitly trained on

CLIP Working

Let us understand the architectural details of CLIP. Below is the architecture of the CLIP neural network:

CLIP-(Contrastive-Language-Image-Pretraining)

**1. Text Encoder

CLIP uses a Transformer-based model (similar to the model in the Attention is All You Need paper). This model converts text into embeddings, dense vectors that capture the meaning of the text. Its text encoder is a 63M-parameter model with 12 layers and 8 attention heads.

**2. Image Encoder

For the image encoder, it experimented with both ResNet and Vision Transformers (ViT). At last ViT was chosen due to its superior performance in processing images. This encoder transforms images into embeddings that capture the image’s key features.

**3. Dataset

CLIP was trained on a massive dataset of 400 million image-text pairs sourced from the web. The team focused on using words that appeared at least 100 times in the English Wikipedia which ensures that 500,000 words were covered. This dataset called **WebImageText (WIT) is important to CLIP’s ability to generalize to various visual and textual concepts.

**4. Training Objective

CLIP's goal is to align text and image embeddings which ensures that correct pairs of image and text are similar while incorrect pairs are not.

CLIP’s Unique Features

CLIP stands out from traditional models because of these key features:

Real-World Applications of CLIP

CLIP has become a key part of many advanced AI models. Some of its most important uses include:

Limitations of CLIP

By mastering the integration of text and image data, CLIP opens up new possibilities for AI models which allows them to tackle a wide range of tasks with minimal training and enhanced versatility.