ALIGN (original) (raw)

PyTorch

Overview

The ALIGN model was proposed in Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision by Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V. Le, Yunhsuan Sung, Zhen Li, Tom Duerig. ALIGN is a multi-modal vision and language model. It can be used for image-text similarity and for zero-shot image classification. ALIGN features a dual-encoder architecture with EfficientNet as its vision encoder and BERT as its text encoder, and learns to align visual and text representations with contrastive learning. Unlike previous work, ALIGN leverages a massive noisy dataset and shows that the scale of the corpus can be used to achieve SOTA representations with a simple recipe.

The abstract from the paper is the following:

Pre-trained representations are becoming crucial for many NLP and perception tasks. While representation learning in NLP has transitioned to training on raw text without human annotations, visual and vision-language representations still rely heavily on curated training datasets that are expensive or require expert knowledge. For vision applications, representations are mostly learned using datasets with explicit class labels such as ImageNet or OpenImages. For vision-language, popular datasets like Conceptual Captions, MSCOCO, or CLIP all involve a non-trivial data collection (and cleaning) process. This costly curation process limits the size of datasets and hence hinders the scaling of trained models. In this paper, we leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps in the Conceptual Captions dataset. A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss. We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme. Our visual representation achieves strong performance when transferred to classification tasks such as ImageNet and VTAB. The aligned visual and language representations enables zero-shot image classification and also set new state-of-the-art results on Flickr30K and MSCOCO image-text retrieval benchmarks, even when compared with more sophisticated cross-attention models. The representations also enable cross-modality search with complex text and text + image queries.

This model was contributed by Alara Dirik. The original code is not released, this implementation is based on the Kakao Brain implementation based on the original paper.

Usage example

ALIGN uses EfficientNet to get visual features and BERT to get the text features. Both the text and visual features are then projected to a latent space with identical dimension. The dot product between the projected image and text features is then used as a similarity score.

AlignProcessor wraps EfficientNetImageProcessor and BertTokenizer into a single instance to both encode the text and preprocess the images. The following example shows how to get the image-text similarity scores using AlignProcessor and AlignModel.

import requests import torch from PIL import Image from transformers import AlignProcessor, AlignModel

processor = AlignProcessor.from_pretrained("kakaobrain/align-base") model = AlignModel.from_pretrained("kakaobrain/align-base")

url = "http://images.cocodataset.org/val2017/000000039769.jpg" image = Image.open(requests.get(url, stream=True).raw) candidate_labels = ["an image of a cat", "an image of a dog"]

inputs = processor(images=image ,text=candidate_labels, return_tensors="pt")

with torch.no_grad(): outputs = model(**inputs)

logits_per_image = outputs.logits_per_image

probs = logits_per_image.softmax(dim=1) print(probs)

Resources

A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with ALIGN.

If you’re interested in submitting a resource to be included here, please feel free to open a Pull Request and we will review it. The resource should ideally demonstrate something new instead of duplicating an existing resource.

AlignConfig

class transformers.AlignConfig

< source >

( text_config = None vision_config = None projection_dim = 640 temperature_init_value = 1.0 initializer_range = 0.02 **kwargs )

Parameters

AlignConfig is the configuration class to store the configuration of a AlignModel. It is used to instantiate a ALIGN model according to the specified arguments, defining the text model and vision model configs. Instantiating a configuration with the defaults will yield a similar configuration to that of the ALIGNkakaobrain/align-base architecture.

Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.

Example:

from transformers import AlignConfig, AlignModel

configuration = AlignConfig()

model = AlignModel(configuration)

configuration = model.config

from transformers import AlignTextConfig, AlignVisionConfig

config_text = AlignTextConfig() config_vision = AlignVisionConfig()

config = AlignConfig.from_text_vision_configs(config_text, config_vision)

from_text_vision_configs

< source >

( text_config: AlignTextConfig vision_config: AlignVisionConfig **kwargs ) β†’ AlignConfig

An instance of a configuration object

Instantiate a AlignConfig (or a derived class) from align text model configuration and align vision model configuration.

AlignTextConfig

class transformers.AlignTextConfig

< source >

( vocab_size = 30522 hidden_size = 768 num_hidden_layers = 12 num_attention_heads = 12 intermediate_size = 3072 hidden_act = 'gelu' hidden_dropout_prob = 0.1 attention_probs_dropout_prob = 0.1 max_position_embeddings = 512 type_vocab_size = 2 initializer_range = 0.02 layer_norm_eps = 1e-12 pad_token_id = 0 position_embedding_type = 'absolute' use_cache = True **kwargs )

Parameters

This is the configuration class to store the configuration of a AlignTextModel. It is used to instantiate a ALIGN text encoder according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the text encoder of the ALIGNkakaobrain/align-base architecture. The default values here are copied from BERT.

Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.

Example:

from transformers import AlignTextConfig, AlignTextModel

configuration = AlignTextConfig()

model = AlignTextModel(configuration)

configuration = model.config

AlignVisionConfig

class transformers.AlignVisionConfig

< source >

( num_channels: int = 3 image_size: int = 600 width_coefficient: float = 2.0 depth_coefficient: float = 3.1 depth_divisor: int = 8 kernel_sizes: typing.List[int] = [3, 3, 5, 3, 5, 5, 3] in_channels: typing.List[int] = [32, 16, 24, 40, 80, 112, 192] out_channels: typing.List[int] = [16, 24, 40, 80, 112, 192, 320] depthwise_padding: typing.List[int] = [] strides: typing.List[int] = [1, 2, 2, 2, 1, 2, 1] num_block_repeats: typing.List[int] = [1, 2, 2, 3, 3, 4, 1] expand_ratios: typing.List[int] = [1, 6, 6, 6, 6, 6, 6] squeeze_expansion_ratio: float = 0.25 hidden_act: str = 'swish' hidden_dim: int = 2560 pooling_type: str = 'mean' initializer_range: float = 0.02 batch_norm_eps: float = 0.001 batch_norm_momentum: float = 0.99 drop_connect_rate: float = 0.2 **kwargs )

Parameters

This is the configuration class to store the configuration of a AlignVisionModel. It is used to instantiate a ALIGN vision encoder according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the vision encoder of the ALIGNkakaobrain/align-base architecture. The default values are copied from EfficientNet (efficientnet-b7)

Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.

Example:

from transformers import AlignVisionConfig, AlignVisionModel

configuration = AlignVisionConfig()

model = AlignVisionModel(configuration)

configuration = model.config

AlignProcessor

class transformers.AlignProcessor

< source >

( image_processor tokenizer )

Parameters

Constructs an ALIGN processor which wraps EfficientNetImageProcessor andBertTokenizer/BertTokenizerFast into a single processor that inherits both the image processor and tokenizer functionalities. See the __call__() and decode() for more information. The preferred way of passing kwargs is as a dictionary per modality, see usage example below.

from transformers import AlignProcessor from PIL import Image model_id = "kakaobrain/align-base" processor = AlignProcessor.from_pretrained(model_id)

processor( images=your_pil_image, text=["What is that?"], images_kwargs = {"crop_size": {"height": 224, "width": 224}}, text_kwargs = {"padding": "do_not_pad"}, common_kwargs = {"return_tensors": "pt"}, )

This method forwards all its arguments to BertTokenizerFast’s batch_decode(). Please refer to the docstring of this method for more information.

This method forwards all its arguments to BertTokenizerFast’s decode(). Please refer to the docstring of this method for more information.

AlignModel

class transformers.AlignModel

< source >

( config: AlignConfig )

Parameters

The bare Align Model outputting raw hidden-states without any specific head on top.

This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward

< source >

( input_ids: typing.Optional[torch.LongTensor] = None pixel_values: typing.Optional[torch.FloatTensor] = None attention_mask: typing.Optional[torch.Tensor] = None token_type_ids: typing.Optional[torch.Tensor] = None position_ids: typing.Optional[torch.Tensor] = None head_mask: typing.Optional[torch.Tensor] = None inputs_embeds: typing.Optional[torch.Tensor] = None return_loss: typing.Optional[bool] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None ) β†’ transformers.models.align.modeling_align.AlignOutput or tuple(torch.FloatTensor)

Parameters

Returns

transformers.models.align.modeling_align.AlignOutput or tuple(torch.FloatTensor)

A transformers.models.align.modeling_align.AlignOutput or a tuple oftorch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (AlignConfig) and inputs.

The AlignModel forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Examples:

from PIL import Image import requests from transformers import AutoProcessor, AlignModel

model = AlignModel.from_pretrained("kakaobrain/align-base") processor = AutoProcessor.from_pretrained("kakaobrain/align-base")

url = "http://images.cocodataset.org/val2017/000000039769.jpg" image = Image.open(requests.get(url, stream=True).raw)

inputs = processor( ... images=image, text=["a photo of a cat", "a photo of a dog"], return_tensors="pt", padding=True ... )

outputs = model(**inputs) logits_per_image = outputs.logits_per_image
probs = logits_per_image.softmax(dim=1)

get_text_features

< source >

( input_ids: typing.Optional[torch.Tensor] = None attention_mask: typing.Optional[torch.Tensor] = None token_type_ids: typing.Optional[torch.Tensor] = None position_ids: typing.Optional[torch.Tensor] = None head_mask: typing.Optional[torch.Tensor] = None inputs_embeds: typing.Optional[torch.Tensor] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None ) β†’ text_features (torch.FloatTensor of shape (batch_size, output_dim)

Parameters

Returns

text_features (torch.FloatTensor of shape (batch_size, output_dim)

The text embeddings obtained by applying the projection layer to the pooled output of AlignTextModel.

Examples:

from transformers import AutoTokenizer, AlignModel

model = AlignModel.from_pretrained("kakaobrain/align-base") tokenizer = AutoTokenizer.from_pretrained("kakaobrain/align-base")

inputs = tokenizer(["a photo of a cat", "a photo of a dog"], padding=True, return_tensors="pt") text_features = model.get_text_features(**inputs)

get_image_features

< source >

( pixel_values: typing.Optional[torch.FloatTensor] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None ) β†’ image_features (torch.FloatTensor of shape (batch_size, output_dim)

Parameters

Returns

image_features (torch.FloatTensor of shape (batch_size, output_dim)

The image embeddings obtained by applying the projection layer to the pooled output of AlignVisionModel.

Examples:

from PIL import Image import requests from transformers import AutoProcessor, AlignModel

model = AlignModel.from_pretrained("kakaobrain/align-base") processor = AutoProcessor.from_pretrained("kakaobrain/align-base")

url = "http://images.cocodataset.org/val2017/000000039769.jpg" image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(images=image, return_tensors="pt")

image_features = model.get_image_features(**inputs)

AlignTextModel

class transformers.AlignTextModel

< source >

( config: AlignTextConfig add_pooling_layer: bool = True )

Parameters

The text model from ALIGN without any head or projection on top.

This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward

< source >

( input_ids: typing.Optional[torch.Tensor] = None attention_mask: typing.Optional[torch.Tensor] = None token_type_ids: typing.Optional[torch.Tensor] = None position_ids: typing.Optional[torch.Tensor] = None head_mask: typing.Optional[torch.Tensor] = None inputs_embeds: typing.Optional[torch.Tensor] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None ) β†’ transformers.modeling_outputs.BaseModelOutputWithPoolingAndCrossAttentions or tuple(torch.FloatTensor)

Parameters

A transformers.modeling_outputs.BaseModelOutputWithPoolingAndCrossAttentions or a tuple oftorch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (AlignConfig) and inputs.

The AlignTextModel forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Examples:

from transformers import AutoTokenizer, AlignTextModel

model = AlignTextModel.from_pretrained("kakaobrain/align-base") tokenizer = AutoTokenizer.from_pretrained("kakaobrain/align-base")

inputs = tokenizer(["a photo of a cat", "a photo of a dog"], padding=True, return_tensors="pt")

outputs = model(**inputs) last_hidden_state = outputs.last_hidden_state pooled_output = outputs.pooler_output

AlignVisionModel

class transformers.AlignVisionModel

< source >

( config: AlignVisionConfig )

Parameters

The vision model from ALIGN without any head or projection on top.

This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward

< source >

( pixel_values: typing.Optional[torch.FloatTensor] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None ) β†’ transformers.modeling_outputs.BaseModelOutputWithPoolingAndNoAttention or tuple(torch.FloatTensor)

Parameters

Returns

transformers.modeling_outputs.BaseModelOutputWithPoolingAndNoAttention or tuple(torch.FloatTensor)

A transformers.modeling_outputs.BaseModelOutputWithPoolingAndNoAttention or a tuple oftorch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (AlignConfig) and inputs.

The AlignVisionModel forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Examples:

from PIL import Image import requests from transformers import AutoProcessor, AlignVisionModel

model = AlignVisionModel.from_pretrained("kakaobrain/align-base") processor = AutoProcessor.from_pretrained("kakaobrain/align-base")

url = "http://images.cocodataset.org/val2017/000000039769.jpg" image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(images=image, return_tensors="pt")

outputs = model(**inputs) last_hidden_state = outputs.last_hidden_state pooled_output = outputs.pooler_output

< > Update on GitHub