ALIGN: A Largescale ImaGe and Noisytext Model (original) (raw)

ALIGN: A Large-scale ImaGe and Noisy-text Model

Last Updated : 21 Aug, 2025

ALIGN (A Large-scale Image and Noisy-text) is a vision-language model developed to align images with their associated textual descriptions, even when those texts are noisy or sourced directly from the web. It uses a massive minimally-filtered dataset and a dual-encoder architecture to embed both images and texts in a shared semantic space, enabling robust cross-modal understanding and retrieval at unprecedented scale. Dual-Encoder Structure is adapted in the ALIGN Model,

Step-by-Step Implementation

Let's see the step-by-step implementation of ALIGN,

Step 1: Import Libraries

Let's import all the required libraries,

from PIL import Image from transformers import AlignProcessor, AlignModel import torch from torch.nn.functional import cosine_similarity import requests from io import BytesIO

`

Step 2: Model and Processor Initialization

Let's initialize the model,

processor = AlignProcessor.from_pretrained("kakaobrain/align-base") model = AlignModel.from_pretrained("kakaobrain/align-base")

`

Step 3: Text and Image Inputs

Input the text and image URL's,

texts = ["A cat sitting on a mat", "A dog playing with a ball"] image_urls = [ "https://i.postimg.cc/7hJL9KK4/fluffy-siberian-cat-sitting-on-600nw-2150187551.png", "https://i.postimg.cc/4xgzKc3y/images.jpg" ]

`

Step 4: Download and Prepare Images

The model downloads the image from the URL for further processing by converting to RGB,

Python `

images = [] for url in image_urls: resp = requests.get(url) if resp.status_code == 200: img = Image.open(BytesIO(resp.content)).convert("RGB") images.append(img)

`

Step 5: Preprocess Images and Texts along with Device Setup

The images are converted to padded PyTorch tensors since the neural networks require uniform and correctly typed batch input. Use GPU if available otherwise CPU.

Python `

inputs = processor( images=images, text=texts, return_tensors="pt", padding='max_length', max_length=128, truncation=True ) device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model = model.to(device) inputs = {k: v.to(device) for k, v in inputs.items()}

`

Step 6: Compute Model Outputs and Extract Embeddings

outputs = model(**inputs)

image_embeds = outputs.image_embeds text_embeds = outputs.text_embeds

`

Step 7: Calculate Cosine Similarity and Display Result

Compute the similarity for each image-text pair and display the results.

Python `

similarity = cosine_similarity(image_embeds, text_embeds)

for i, score in enumerate(similarity): print(f"Similarity score for pair {i+1}: {score.item():.4f}")

`

**Output:

Similarity score for pair 1: 0.2410

Similarity score for pair 2: 0.2758

Application of ALIGN

Advantages

Limitations