CLIP (original) (raw)

Overview

The CLIP model was proposed in Learning Transferable Visual Models From Natural Language Supervision by Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever. CLIP (Contrastive Language-Image Pre-Training) is a neural network trained on a variety of (image, text) pairs. It can be instructed in natural language to predict the most relevant text snippet, given an image, without directly optimizing for the task, similarly to the zero-shot capabilities of GPT-2 and 3.

The abstract from the paper is the following:

State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on. We release our code and pre-trained model weights at this https URL.

This model was contributed by valhalla. The original code can be found here.

Usage tips and example

CLIP is a multi-modal vision and language model. It can be used for image-text similarity and for zero-shot image classification. CLIP uses a ViT like transformer to get visual features and a causal language model to get the text features. Both the text and visual features are then projected to a latent space with identical dimension. The dot product between the projected image and text features is then used as a similar score.

To feed images to the Transformer encoder, each image is split into a sequence of fixed-size non-overlapping patches, which are then linearly embedded. A [CLS] token is added to serve as representation of an entire image. The authors also add absolute position embeddings, and feed the resulting sequence of vectors to a standard Transformer encoder. The CLIPImageProcessor can be used to resize (or rescale) and normalize images for the model.

The CLIPTokenizer is used to encode the text. The CLIPProcessor wrapsCLIPImageProcessor and CLIPTokenizer into a single instance to both encode the text and prepare the images. The following example shows how to get the image-text similarity scores usingCLIPProcessor and CLIPModel.

from PIL import Image import requests

from transformers import CLIPProcessor, CLIPModel

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32") processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

url = "http://images.cocodataset.org/val2017/000000039769.jpg" image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True)

outputs = model(**inputs) logits_per_image = outputs.logits_per_image
probs = logits_per_image.softmax(dim=1)

Combining CLIP and Flash Attention 2

First, make sure to install the latest version of Flash Attention 2.

pip install -U flash-attn --no-build-isolation

Make also sure that you have a hardware that is compatible with Flash-Attention 2. Read more about it in the official documentation of flash-attn repository. Make also sure to load your model in half-precision (e.g. torch.float16)

For small batch sizes, you might notice a slowdown in your model when using flash attention. Refer to the section Expected speedups with Flash Attention and SDPA below and select an appropriate attention implementation.

To load and run a model using Flash Attention 2, refer to the snippet below:

import torch import requests from PIL import Image

from transformers import CLIPProcessor, CLIPModel

device = "cuda" torch_dtype = torch.float16

model = CLIPModel.from_pretrained( ... "openai/clip-vit-base-patch32", ... attn_implementation="flash_attention_2", ... device_map=device, ... torch_dtype=torch_dtype, ... ) processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

url = "http://images.cocodataset.org/val2017/000000039769.jpg" image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True) inputs.to(device)

with torch.no_grad(): ... with torch.autocast(device): ... outputs = model(**inputs)

logits_per_image = outputs.logits_per_image
probs = logits_per_image.softmax(dim=1)
print(probs) tensor([[0.9946, 0.0052]], device='cuda:0', dtype=torch.float16)

Using Scaled Dot Product Attention (SDPA)

PyTorch includes a native scaled dot-product attention (SDPA) operator as part of torch.nn.functional. This function encompasses several implementations that can be applied depending on the inputs and the hardware in use. See theofficial documentationor the GPU Inferencepage for more information.

SDPA is used by default for torch>=2.1.1 when an implementation is available, but you may also setattn_implementation="sdpa" in from_pretrained() to explicitly request SDPA to be used.

from transformers import CLIPModel

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32", torch_dtype=torch.float16, attn_implementation="sdpa")

For the best speedups, we recommend loading the model in half-precision (e.g. torch.float16 or torch.bfloat16).

Expected speedups with Flash Attention and SDPA

On a local benchmark (NVIDIA A10G, PyTorch 2.3.1+cu121) with float16, we saw the following speedups during inference for "openai/clip-vit-large-patch14" checkpoint (code):

CLIPTextModel

Num text labels Eager (s/iter) FA2 (s/iter) FA2 speedup SDPA (s/iter) SDPA speedup
4 0.009 0.012 0.737 0.007 1.269
16 0.009 0.014 0.659 0.008 1.187
32 0.018 0.021 0.862 0.016 1.142
64 0.034 0.034 1.001 0.03 1.163
128 0.063 0.058 1.09 0.054 1.174

clip_text_model_viz_3

CLIPVisionModel

Image batch size Eager (s/iter) FA2 (s/iter) FA2 speedup SDPA (s/iter) SDPA speedup
1 0.016 0.013 1.247 0.012 1.318
4 0.025 0.021 1.198 0.021 1.202
16 0.093 0.075 1.234 0.075 1.24
32 0.181 0.147 1.237 0.146 1.241

clip_image_model_viz_3

CLIPModel

Image batch size Num text labels Eager (s/iter) FA2 (s/iter) FA2 speedup SDPA (s/iter) SDPA speedup
1 4 0.025 0.026 0.954 0.02 1.217
1 16 0.026 0.028 0.918 0.02 1.287
1 64 0.042 0.046 0.906 0.036 1.167
4 4 0.028 0.033 0.849 0.024 1.189
4 16 0.034 0.035 0.955 0.029 1.169
4 64 0.059 0.055 1.072 0.05 1.179
16 4 0.096 0.088 1.091 0.078 1.234
16 16 0.102 0.09 1.129 0.083 1.224
16 64 0.127 0.11 1.157 0.105 1.218
32 4 0.185 0.159 1.157 0.149 1.238
32 16 0.19 0.162 1.177 0.154 1.233
32 64 0.216 0.181 1.19 0.176 1.228

Resources

A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with CLIP.

Image retrieval

Explainability

If you’re interested in submitting a resource to be included here, please feel free to open a Pull Request and we will review it. The resource should ideally demonstrate something new instead of duplicating an existing resource.

CLIPConfig

class transformers.CLIPConfig

< source >

( text_config = None vision_config = None projection_dim = 512 logit_scale_init_value = 2.6592 **kwargs )

Parameters

CLIPConfig is the configuration class to store the configuration of a CLIPModel. It is used to instantiate a CLIP model according to the specified arguments, defining the text model and vision model configs. Instantiating a configuration with the defaults will yield a similar configuration to that of the CLIPopenai/clip-vit-base-patch32 architecture.

Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.

Example:

from transformers import CLIPConfig, CLIPModel

configuration = CLIPConfig()

model = CLIPModel(configuration)

configuration = model.config

from transformers import CLIPTextConfig, CLIPVisionConfig

config_text = CLIPTextConfig() config_vision = CLIPVisionConfig()

config = CLIPConfig.from_text_vision_configs(config_text, config_vision)

from_text_vision_configs

< source >

( text_config: CLIPTextConfig vision_config: CLIPVisionConfig **kwargs ) β†’ CLIPConfig

An instance of a configuration object

Instantiate a CLIPConfig (or a derived class) from clip text model configuration and clip vision model configuration.

CLIPTextConfig

class transformers.CLIPTextConfig

< source >

( vocab_size = 49408 hidden_size = 512 intermediate_size = 2048 projection_dim = 512 num_hidden_layers = 12 num_attention_heads = 8 max_position_embeddings = 77 hidden_act = 'quick_gelu' layer_norm_eps = 1e-05 attention_dropout = 0.0 initializer_range = 0.02 initializer_factor = 1.0 pad_token_id = 1 bos_token_id = 49406 eos_token_id = 49407 **kwargs )

Parameters

This is the configuration class to store the configuration of a CLIPTextModel. It is used to instantiate a CLIP text encoder according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the text encoder of the CLIPopenai/clip-vit-base-patch32 architecture.

Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.

Example:

from transformers import CLIPTextConfig, CLIPTextModel

configuration = CLIPTextConfig()

model = CLIPTextModel(configuration)

configuration = model.config

CLIPVisionConfig

class transformers.CLIPVisionConfig

< source >

( hidden_size = 768 intermediate_size = 3072 projection_dim = 512 num_hidden_layers = 12 num_attention_heads = 12 num_channels = 3 image_size = 224 patch_size = 32 hidden_act = 'quick_gelu' layer_norm_eps = 1e-05 attention_dropout = 0.0 initializer_range = 0.02 initializer_factor = 1.0 **kwargs )

Parameters

This is the configuration class to store the configuration of a CLIPVisionModel. It is used to instantiate a CLIP vision encoder according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the vision encoder of the CLIPopenai/clip-vit-base-patch32 architecture.

Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.

Example:

from transformers import CLIPVisionConfig, CLIPVisionModel

configuration = CLIPVisionConfig()

model = CLIPVisionModel(configuration)

configuration = model.config

CLIPTokenizer

class transformers.CLIPTokenizer

< source >

( vocab_file merges_file errors = 'replace' unk_token = '<|endoftext|>' bos_token = '<|startoftext|>' eos_token = '<|endoftext|>' pad_token = '<|endoftext|>' **kwargs )

Parameters

Construct a CLIP tokenizer. Based on byte-level Byte-Pair-Encoding.

This tokenizer inherits from PreTrainedTokenizer which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.

build_inputs_with_special_tokens

< source >

( token_ids_0: List token_ids_1: Optional = None ) β†’ List[int]

Parameters

List of input IDs with the appropriate special tokens.

Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. A CLIP sequence has the following format:

Pairs of sequences are not the expected use case, but they will be handled without a separator.

get_special_tokens_mask

< source >

( token_ids_0: List token_ids_1: Optional = None already_has_special_tokens: bool = False ) β†’ List[int]

Parameters

A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.

Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the tokenizer prepare_for_model method.

create_token_type_ids_from_sequences

< source >

( token_ids_0: List token_ids_1: Optional = None ) β†’ List[int]

Parameters

List of zeros.

Create a mask from the two sequences passed. CLIP does not make use of token type ids, therefore a list of zeros is returned.

save_vocabulary

< source >

( save_directory: str filename_prefix: Optional = None )

CLIPTokenizerFast

class transformers.CLIPTokenizerFast

< source >

( vocab_file = None merges_file = None tokenizer_file = None unk_token = '<|endoftext|>' bos_token = '<|startoftext|>' eos_token = '<|endoftext|>' pad_token = '<|endoftext|>' **kwargs )

Parameters

Construct a β€œfast” CLIP tokenizer (backed by HuggingFace’s tokenizers library). Based on byte-level Byte-Pair-Encoding.

This tokenizer inherits from PreTrainedTokenizerFast which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.

build_inputs_with_special_tokens

< source >

( token_ids_0: List token_ids_1: Optional = None ) β†’ List[int]

Parameters

List of input IDs with the appropriate special tokens.

Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. A CLIP sequence has the following format:

Pairs of sequences are not the expected use case, but they will be handled without a separator.

create_token_type_ids_from_sequences

< source >

( token_ids_0: List token_ids_1: Optional = None ) β†’ List[int]

Parameters

List of zeros.

Create a mask from the two sequences passed. CLIP does not make use of token type ids, therefore a list of zeros is returned.

CLIPImageProcessor

class transformers.CLIPImageProcessor

< source >

( do_resize: bool = True size: Dict = None resample: Resampling = <Resampling.BICUBIC: 3> do_center_crop: bool = True crop_size: Dict = None do_rescale: bool = True rescale_factor: Union = 0.00392156862745098 do_normalize: bool = True image_mean: Union = None image_std: Union = None do_convert_rgb: bool = True **kwargs )

Parameters

Constructs a CLIP image processor.

preprocess

< source >

( images: Union do_resize: bool = None size: Dict = None resample: Resampling = None do_center_crop: bool = None crop_size: int = None do_rescale: bool = None rescale_factor: float = None do_normalize: bool = None image_mean: Union = None image_std: Union = None do_convert_rgb: bool = None return_tensors: Union = None data_format: Optional = <ChannelDimension.FIRST: 'channels_first'> input_data_format: Union = None **kwargs )

Parameters

Preprocess an image or batch of images.

CLIPFeatureExtractor

CLIPProcessor

class transformers.CLIPProcessor

< source >

( image_processor = None tokenizer = None **kwargs )

Parameters

Constructs a CLIP processor which wraps a CLIP image processor and a CLIP tokenizer into a single processor.

CLIPProcessor offers all the functionalities of CLIPImageProcessor and CLIPTokenizerFast. See the__call__() and decode() for more information.

This method forwards all its arguments to CLIPTokenizerFast’s batch_decode(). Please refer to the docstring of this method for more information.

This method forwards all its arguments to CLIPTokenizerFast’s decode(). Please refer to the docstring of this method for more information.

CLIPModel

class transformers.CLIPModel

< source >

( config: CLIPConfig )

Parameters

This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward

< source >

( input_ids: Optional = None pixel_values: Optional = None attention_mask: Optional = None position_ids: Optional = None return_loss: Optional = None output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None ) β†’ transformers.models.clip.modeling_clip.CLIPOutput or tuple(torch.FloatTensor)

Parameters

Returns

transformers.models.clip.modeling_clip.CLIPOutput or tuple(torch.FloatTensor)

A transformers.models.clip.modeling_clip.CLIPOutput or a tuple oftorch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (<class 'transformers.models.clip.configuration_clip.CLIPConfig'>) and inputs.

The CLIPModel forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Examples:

from PIL import Image import requests from transformers import AutoProcessor, CLIPModel

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32") processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32")

url = "http://images.cocodataset.org/val2017/000000039769.jpg" image = Image.open(requests.get(url, stream=True).raw)

inputs = processor( ... text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True ... )

outputs = model(**inputs) logits_per_image = outputs.logits_per_image
probs = logits_per_image.softmax(dim=1)

get_text_features

< source >

( input_ids: Optional = None attention_mask: Optional = None position_ids: Optional = None output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None ) β†’ text_features (torch.FloatTensor of shape (batch_size, output_dim)

Parameters

Returns

text_features (torch.FloatTensor of shape (batch_size, output_dim)

The text embeddings obtained by applying the projection layer to the pooled output of CLIPTextModel.

The CLIPModel forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Examples:

from transformers import AutoTokenizer, CLIPModel

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32") tokenizer = AutoTokenizer.from_pretrained("openai/clip-vit-base-patch32")

inputs = tokenizer(["a photo of a cat", "a photo of a dog"], padding=True, return_tensors="pt") text_features = model.get_text_features(**inputs)

get_image_features

< source >

( pixel_values: Optional = None output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None ) β†’ image_features (torch.FloatTensor of shape (batch_size, output_dim)

Parameters

Returns

image_features (torch.FloatTensor of shape (batch_size, output_dim)

The image embeddings obtained by applying the projection layer to the pooled output of CLIPVisionModel.

The CLIPModel forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Examples:

from PIL import Image import requests from transformers import AutoProcessor, CLIPModel

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32") processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32")

url = "http://images.cocodataset.org/val2017/000000039769.jpg" image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(images=image, return_tensors="pt")

image_features = model.get_image_features(**inputs)

CLIPTextModel

class transformers.CLIPTextModel

< source >

( config: CLIPTextConfig )

Parameters

The text model from CLIP without any head or projection on top. This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward

< source >

( input_ids: Optional = None attention_mask: Optional = None position_ids: Optional = None output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None ) β†’ transformers.modeling_outputs.BaseModelOutputWithPooling or tuple(torch.FloatTensor)

Parameters

A transformers.modeling_outputs.BaseModelOutputWithPooling or a tuple oftorch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (<class 'transformers.models.clip.configuration_clip.CLIPTextConfig'>) and inputs.

The CLIPTextModel forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Examples:

from transformers import AutoTokenizer, CLIPTextModel

model = CLIPTextModel.from_pretrained("openai/clip-vit-base-patch32") tokenizer = AutoTokenizer.from_pretrained("openai/clip-vit-base-patch32")

inputs = tokenizer(["a photo of a cat", "a photo of a dog"], padding=True, return_tensors="pt")

outputs = model(**inputs) last_hidden_state = outputs.last_hidden_state pooled_output = outputs.pooler_output

CLIPTextModelWithProjection

class transformers.CLIPTextModelWithProjection

< source >

( config: CLIPTextConfig )

Parameters

CLIP Text Model with a projection layer on top (a linear layer on top of the pooled output).

This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward

< source >

( input_ids: Optional = None attention_mask: Optional = None position_ids: Optional = None output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None ) β†’ transformers.models.clip.modeling_clip.CLIPTextModelOutput or tuple(torch.FloatTensor)

Parameters

Returns

transformers.models.clip.modeling_clip.CLIPTextModelOutput or tuple(torch.FloatTensor)

A transformers.models.clip.modeling_clip.CLIPTextModelOutput or a tuple oftorch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (<class 'transformers.models.clip.configuration_clip.CLIPTextConfig'>) and inputs.

The CLIPTextModelWithProjection forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Examples:

from transformers import AutoTokenizer, CLIPTextModelWithProjection

model = CLIPTextModelWithProjection.from_pretrained("openai/clip-vit-base-patch32") tokenizer = AutoTokenizer.from_pretrained("openai/clip-vit-base-patch32")

inputs = tokenizer(["a photo of a cat", "a photo of a dog"], padding=True, return_tensors="pt")

outputs = model(**inputs) text_embeds = outputs.text_embeds

CLIPVisionModelWithProjection

class transformers.CLIPVisionModelWithProjection

< source >

( config: CLIPVisionConfig )

Parameters

CLIP Vision Model with a projection layer on top (a linear layer on top of the pooled output).

This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward

< source >

( pixel_values: Optional = None output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None ) β†’ transformers.models.clip.modeling_clip.CLIPVisionModelOutput or tuple(torch.FloatTensor)

Parameters

Returns

transformers.models.clip.modeling_clip.CLIPVisionModelOutput or tuple(torch.FloatTensor)

A transformers.models.clip.modeling_clip.CLIPVisionModelOutput or a tuple oftorch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (<class 'transformers.models.clip.configuration_clip.CLIPVisionConfig'>) and inputs.

The CLIPVisionModelWithProjection forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Examples:

from PIL import Image import requests from transformers import AutoProcessor, CLIPVisionModelWithProjection

model = CLIPVisionModelWithProjection.from_pretrained("openai/clip-vit-base-patch32") processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32")

url = "http://images.cocodataset.org/val2017/000000039769.jpg" image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(images=image, return_tensors="pt")

outputs = model(**inputs) image_embeds = outputs.image_embeds

CLIPVisionModel

class transformers.CLIPVisionModel

< source >

( config: CLIPVisionConfig )

Parameters

The vision model from CLIP without any head or projection on top. This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward

< source >

( pixel_values: Optional = None output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None ) β†’ transformers.modeling_outputs.BaseModelOutputWithPooling or tuple(torch.FloatTensor)

Parameters

A transformers.modeling_outputs.BaseModelOutputWithPooling or a tuple oftorch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (<class 'transformers.models.clip.configuration_clip.CLIPVisionConfig'>) and inputs.

The CLIPVisionModel forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Examples:

from PIL import Image import requests from transformers import AutoProcessor, CLIPVisionModel

model = CLIPVisionModel.from_pretrained("openai/clip-vit-base-patch32") processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32")

url = "http://images.cocodataset.org/val2017/000000039769.jpg" image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(images=image, return_tensors="pt")

outputs = model(**inputs) last_hidden_state = outputs.last_hidden_state pooled_output = outputs.pooler_output

CLIPForImageClassification

class transformers.CLIPForImageClassification

< source >

( config: CLIPConfig )

Parameters

CLIP vision encoder with an image classification head on top (a linear layer on top of the pooled final hidden states of the patch tokens) e.g. for ImageNet.

This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward

< source >

( pixel_values: Optional = None labels: Optional = None output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None ) β†’ transformers.modeling_outputs.ImageClassifierOutput or tuple(torch.FloatTensor)

Parameters

A transformers.modeling_outputs.ImageClassifierOutput or a tuple oftorch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (CLIPConfig) and inputs.

The CLIPForImageClassification forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Example:

from transformers import AutoImageProcessor, CLIPForImageClassification import torch from datasets import load_dataset

dataset = load_dataset("huggingface/cats-image", trust_remote_code=True) image = dataset["test"]["image"][0]

image_processor = AutoImageProcessor.from_pretrained("openai/clip-vit-base-patch32") model = CLIPForImageClassification.from_pretrained("openai/clip-vit-base-patch32")

inputs = image_processor(image, return_tensors="pt")

with torch.no_grad(): ... logits = model(**inputs).logits

predicted_label = logits.argmax(-1).item() print(model.config.id2label[predicted_label]) LABEL_0

TFCLIPModel

class transformers.TFCLIPModel

< source >

( config: CLIPConfig *inputs **kwargs )

Parameters

This model inherits from TFPreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a keras.Model subclass. Use it as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to general usage and behavior.

TensorFlow models and layers in transformers accept two formats as input:

The reason the second format is supported is that Keras methods prefer this format when passing inputs to models and layers. Because of this support, when using methods like model.fit() things should β€œjust work” for you - just pass your inputs and labels in any format that model.fit() supports! If, however, you want to use the second format outside of Keras methods like fit() and predict(), such as when creating your own layers or models with the Keras Functional API, there are three possibilities you can use to gather all the input Tensors in the first positional argument:

Note that when creating models and layers withsubclassing then you don’t need to worry about any of this, as you can just pass inputs like you would to any other Python function!

call

< source >

( input_ids: TFModelInputType | None = None pixel_values: TFModelInputType | None = None attention_mask: np.ndarray | tf.Tensor | None = None position_ids: np.ndarray | tf.Tensor | None = None return_loss: Optional[bool] = None output_attentions: Optional[bool] = None output_hidden_states: Optional[bool] = None return_dict: Optional[bool] = None training: bool = False ) β†’ transformers.models.clip.modeling_tf_clip.TFCLIPOutput or tuple(tf.Tensor)

Parameters

Returns

transformers.models.clip.modeling_tf_clip.TFCLIPOutput or tuple(tf.Tensor)

A transformers.models.clip.modeling_tf_clip.TFCLIPOutput or a tuple of tf.Tensor (ifreturn_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (<class 'transformers.models.clip.configuration_clip.CLIPConfig'>) and inputs.

The TFCLIPModel forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Examples:

import tensorflow as tf from PIL import Image import requests from transformers import AutoProcessor, TFCLIPModel

model = TFCLIPModel.from_pretrained("openai/clip-vit-base-patch32") processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32")

url = "http://images.cocodataset.org/val2017/000000039769.jpg" image = Image.open(requests.get(url, stream=True).raw)

inputs = processor( ... text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="tf", padding=True ... )

outputs = model(**inputs) logits_per_image = outputs.logits_per_image
probs = tf.nn.softmax(logits_per_image, axis=1)

get_text_features

< source >

( input_ids: TFModelInputType | None = None attention_mask: np.ndarray | tf.Tensor | None = None position_ids: np.ndarray | tf.Tensor | None = None output_attentions: Optional[bool] = None output_hidden_states: Optional[bool] = None return_dict: Optional[bool] = None training: bool = False ) β†’ text_features (tf.Tensor of shape (batch_size, output_dim)

Parameters

Returns

text_features (tf.Tensor of shape (batch_size, output_dim)

The text embeddings obtained by applying the projection layer to the pooled output of TFCLIPTextModel.

The TFCLIPModel forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Examples:

from transformers import AutoTokenizer, TFCLIPModel

model = TFCLIPModel.from_pretrained("openai/clip-vit-base-patch32") tokenizer = AutoTokenizer.from_pretrained("openai/clip-vit-base-patch32")

inputs = tokenizer(["a photo of a cat", "a photo of a dog"], padding=True, return_tensors="tf") text_features = model.get_text_features(**inputs)

get_image_features

< source >

( pixel_values: TFModelInputType | None = None output_attentions: Optional[bool] = None output_hidden_states: Optional[bool] = None return_dict: Optional[bool] = None training: bool = False ) β†’ image_features (tf.Tensor of shape (batch_size, output_dim)

Parameters

Returns

image_features (tf.Tensor of shape (batch_size, output_dim)

The image embeddings obtained by applying the projection layer to the pooled output of TFCLIPVisionModel.

The TFCLIPModel forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Examples:

from PIL import Image import requests from transformers import AutoProcessor, TFCLIPModel

model = TFCLIPModel.from_pretrained("openai/clip-vit-base-patch32") processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32")

url = "http://images.cocodataset.org/val2017/000000039769.jpg" image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(images=image, return_tensors="tf")

image_features = model.get_image_features(**inputs)

TFCLIPTextModel

class transformers.TFCLIPTextModel

< source >

( config: CLIPTextConfig *inputs **kwargs )

call

< source >

( input_ids: TFModelInputType | None = None attention_mask: np.ndarray | tf.Tensor | None = None position_ids: np.ndarray | tf.Tensor | None = None output_attentions: Optional[bool] = None output_hidden_states: Optional[bool] = None return_dict: Optional[bool] = None training: Optional[bool] = False ) β†’ transformers.modeling_tf_outputs.TFBaseModelOutputWithPooling or tuple(tf.Tensor)

Parameters

A transformers.modeling_tf_outputs.TFBaseModelOutputWithPooling or a tuple of tf.Tensor (ifreturn_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (<class 'transformers.models.clip.configuration_clip.CLIPTextConfig'>) and inputs.

The TFCLIPTextModel forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Examples:

from transformers import AutoTokenizer, TFCLIPTextModel

model = TFCLIPTextModel.from_pretrained("openai/clip-vit-base-patch32") tokenizer = AutoTokenizer.from_pretrained("openai/clip-vit-base-patch32")

inputs = tokenizer(["a photo of a cat", "a photo of a dog"], padding=True, return_tensors="tf")

outputs = model(**inputs) last_hidden_state = outputs.last_hidden_state pooled_output = outputs.pooler_output

TFCLIPVisionModel

class transformers.TFCLIPVisionModel

< source >

( config: CLIPVisionConfig *inputs **kwargs )

call

< source >

( pixel_values: TFModelInputType | None = None output_attentions: Optional[bool] = None output_hidden_states: Optional[bool] = None return_dict: Optional[bool] = None training: Optional[bool] = False ) β†’ transformers.modeling_tf_outputs.TFBaseModelOutputWithPooling or tuple(tf.Tensor)

Parameters

A transformers.modeling_tf_outputs.TFBaseModelOutputWithPooling or a tuple of tf.Tensor (ifreturn_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (<class 'transformers.models.clip.configuration_clip.CLIPVisionConfig'>) and inputs.

The TFCLIPVisionModel forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Examples:

from PIL import Image import requests from transformers import AutoProcessor, TFCLIPVisionModel

model = TFCLIPVisionModel.from_pretrained("openai/clip-vit-base-patch32") processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32")

url = "http://images.cocodataset.org/val2017/000000039769.jpg" image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(images=image, return_tensors="tf")

outputs = model(**inputs) last_hidden_state = outputs.last_hidden_state pooled_output = outputs.pooler_output

FlaxCLIPModel

class transformers.FlaxCLIPModel

< source >

( config: CLIPConfig input_shape: Optional = None seed: int = 0 dtype: dtype = <class 'jax.numpy.float32'> _do_init: bool = True **kwargs )

Parameters

This model inherits from FlaxPreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading, saving and converting weights from PyTorch models)

This model is also aflax.linen.Module subclass. Use it as a regular Flax linen Module and refer to the Flax documentation for all matter related to general usage and behavior.

Finally, this model supports inherent JAX features such as:

__call__

< source >

( input_ids pixel_values attention_mask = None position_ids = None params: dict = None dropout_rng: PRNGKey = None train: bool = False output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None ) β†’ transformers.models.clip.modeling_flax_clip.FlaxCLIPOutput or tuple(torch.FloatTensor)

Parameters

Returns

transformers.models.clip.modeling_flax_clip.FlaxCLIPOutput or tuple(torch.FloatTensor)

A transformers.models.clip.modeling_flax_clip.FlaxCLIPOutput or a tuple oftorch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (<class 'transformers.models.clip.configuration_clip.CLIPConfig'>) and inputs.

The FlaxCLIPPreTrainedModel forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Example:

import jax from PIL import Image import requests from transformers import AutoProcessor, FlaxCLIPModel

model = FlaxCLIPModel.from_pretrained("openai/clip-vit-base-patch32") processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32")

url = "http://images.cocodataset.org/val2017/000000039769.jpg" image = Image.open(requests.get(url, stream=True).raw)

inputs = processor( ... text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="np", padding=True ... )

outputs = model(**inputs) logits_per_image = outputs.logits_per_image
probs = jax.nn.softmax(logits_per_image, axis=1)

get_text_features

< source >

( input_ids attention_mask = None position_ids = None params: dict = None dropout_rng: PRNGKey = None train = False ) β†’ text_features (jnp.ndarray of shape (batch_size, output_dim)

Examples:

from transformers import AutoTokenizer, FlaxCLIPModel

model = FlaxCLIPModel.from_pretrained("openai/clip-vit-base-patch32") tokenizer = AutoTokenizer.from_pretrained("openai/clip-vit-base-patch32")

inputs = tokenizer(["a photo of a cat", "a photo of a dog"], padding=True, return_tensors="np") text_features = model.get_text_features(**inputs)

get_image_features

< source >

( pixel_values params: dict = None dropout_rng: PRNGKey = None train = False ) β†’ image_features (jnp.ndarray of shape (batch_size, output_dim)

Parameters

Returns

image_features (jnp.ndarray of shape (batch_size, output_dim)

The image embeddings obtained by applying the projection layer to the pooled output of FlaxCLIPVisionModel

Examples:

from PIL import Image import requests from transformers import AutoProcessor, FlaxCLIPModel

model = FlaxCLIPModel.from_pretrained("openai/clip-vit-base-patch32") processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32")

url = "http://images.cocodataset.org/val2017/000000039769.jpg" image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(images=image, return_tensors="np")

image_features = model.get_image_features(**inputs)

FlaxCLIPTextModel

class transformers.FlaxCLIPTextModel

< source >

( config: CLIPTextConfig input_shape = (1, 1) seed: int = 0 dtype: dtype = <class 'jax.numpy.float32'> _do_init: bool = True **kwargs )

__call__

< source >

( input_ids attention_mask = None position_ids = None params: dict = None dropout_rng: PRNGKey = None train: bool = False output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None ) β†’ transformers.modeling_flax_outputs.FlaxBaseModelOutputWithPooling or tuple(torch.FloatTensor)

Parameters

A transformers.modeling_flax_outputs.FlaxBaseModelOutputWithPooling or a tuple oftorch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (<class 'transformers.models.clip.configuration_clip.CLIPTextConfig'>) and inputs.

The FlaxCLIPTextPreTrainedModel forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Example:

from transformers import AutoTokenizer, FlaxCLIPTextModel

model = FlaxCLIPTextModel.from_pretrained("openai/clip-vit-base-patch32") tokenizer = AutoTokenizer.from_pretrained("openai/clip-vit-base-patch32")

inputs = tokenizer(["a photo of a cat", "a photo of a dog"], padding=True, return_tensors="np")

outputs = model(**inputs) last_hidden_state = outputs.last_hidden_state pooler_output = outputs.pooler_output

FlaxCLIPTextModelWithProjection

class transformers.FlaxCLIPTextModelWithProjection

< source >

( config: CLIPTextConfig input_shape = (1, 1) seed: int = 0 dtype: dtype = <class 'jax.numpy.float32'> _do_init: bool = True **kwargs )

__call__

< source >

( input_ids attention_mask = None position_ids = None params: dict = None dropout_rng: PRNGKey = None train: bool = False output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None ) β†’ transformers.models.clip.modeling_flax_clip.FlaxCLIPTextModelOutput or tuple(torch.FloatTensor)

Parameters

Returns

transformers.models.clip.modeling_flax_clip.FlaxCLIPTextModelOutput or tuple(torch.FloatTensor)

A transformers.models.clip.modeling_flax_clip.FlaxCLIPTextModelOutput or a tuple oftorch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (<class 'transformers.models.clip.configuration_clip.CLIPTextConfig'>) and inputs.

The FlaxCLIPTextPreTrainedModel forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Example:

from transformers import AutoTokenizer, FlaxCLIPTextModelWithProjection

model = FlaxCLIPTextModelWithProjection.from_pretrained("openai/clip-vit-base-patch32") tokenizer = AutoTokenizer.from_pretrained("openai/clip-vit-base-patch32")

inputs = tokenizer(["a photo of a cat", "a photo of a dog"], padding=True, return_tensors="np")

outputs = model(**inputs) text_embeds = outputs.text_embeds

FlaxCLIPVisionModel

class transformers.FlaxCLIPVisionModel

< source >

( config: CLIPVisionConfig input_shape: Optional = None seed: int = 0 dtype: dtype = <class 'jax.numpy.float32'> _do_init: bool = True **kwargs )

__call__

< source >

( pixel_values params: dict = None dropout_rng: PRNGKey = None train: bool = False output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None ) β†’ transformers.modeling_flax_outputs.FlaxBaseModelOutputWithPooling or tuple(torch.FloatTensor)

Parameters

A transformers.modeling_flax_outputs.FlaxBaseModelOutputWithPooling or a tuple oftorch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (<class 'transformers.models.clip.configuration_clip.CLIPVisionConfig'>) and inputs.

The FlaxCLIPVisionPreTrainedModel forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Example:

from PIL import Image import requests from transformers import AutoProcessor, FlaxCLIPVisionModel

model = FlaxCLIPVisionModel.from_pretrained("openai/clip-vit-base-patch32") processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32")

url = "http://images.cocodataset.org/val2017/000000039769.jpg" image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(images=image, return_tensors="np")

outputs = model(**inputs) last_hidden_state = outputs.last_hidden_state pooler_output = outputs.pooler_output

< > Update on GitHub