VisualBERT (original) (raw)

PyTorch

Overview

The VisualBERT model was proposed in VisualBERT: A Simple and Performant Baseline for Vision and Language by Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Chang. VisualBERT is a neural network trained on a variety of (image, text) pairs.

The abstract from the paper is the following:

We propose VisualBERT, a simple and flexible framework for modeling a broad range of vision-and-language tasks. VisualBERT consists of a stack of Transformer layers that implicitly align elements of an input text and regions in an associated input image with self-attention. We further propose two visually-grounded language model objectives for pre-training VisualBERT on image caption data. Experiments on four vision-and-language tasks including VQA, VCR, NLVR2, and Flickr30K show that VisualBERT outperforms or rivals with state-of-the-art models while being significantly simpler. Further analysis demonstrates that VisualBERT can ground elements of language to image regions without any explicit supervision and is even sensitive to syntactic relationships, tracking, for example, associations between verbs and image regions corresponding to their arguments.

This model was contributed by gchhablani. The original code can be found here.

Usage tips

  1. Most of the checkpoints provided work with the VisualBertForPreTraining configuration. Other checkpoints provided are the fine-tuned checkpoints for down-stream tasks - VQA (‘visualbert-vqa’), VCR (‘visualbert-vcr’), NLVR2 (‘visualbert-nlvr2’). Hence, if you are not working on these downstream tasks, it is recommended that you use the pretrained checkpoints.
  2. For the VCR task, the authors use a fine-tuned detector for generating visual embeddings, for all the checkpoints. We do not provide the detector and its weights as a part of the package, but it will be available in the research projects, and the states can be loaded directly into the detector provided.

VisualBERT is a multi-modal vision and language model. It can be used for visual question answering, multiple choice, visual reasoning and region-to-phrase correspondence tasks. VisualBERT uses a BERT-like transformer to prepare embeddings for image-text pairs. Both the text and visual features are then projected to a latent space with identical dimension.

To feed images to the model, each image is passed through a pre-trained object detector and the regions and the bounding boxes are extracted. The authors use the features generated after passing these regions through a pre-trained CNN like ResNet as visual embeddings. They also add absolute position embeddings, and feed the resulting sequence of vectors to a standard BERT model. The text input is concatenated in the front of the visual embeddings in the embedding layer, and is expected to be bound by [CLS] and a [SEP] tokens, as in BERT. The segment IDs must also be set appropriately for the textual and visual parts.

The BertTokenizer is used to encode the text. A custom detector/image processor must be used to get the visual embeddings. The following example notebooks show how to use VisualBERT with Detectron-like models:

The following example shows how to get the last hidden state using VisualBertModel:

import torch from transformers import BertTokenizer, VisualBertModel

model = VisualBertModel.from_pretrained("uclanlp/visualbert-vqa-coco-pre") tokenizer = BertTokenizer.from_pretrained("google-bert/bert-base-uncased")

inputs = tokenizer("What is the man eating?", return_tensors="pt")

visual_embeds = get_visual_embeddings(image_path)

visual_token_type_ids = torch.ones(visual_embeds.shape[:-1], dtype=torch.long) visual_attention_mask = torch.ones(visual_embeds.shape[:-1], dtype=torch.float) inputs.update( ... { ... "visual_embeds": visual_embeds, ... "visual_token_type_ids": visual_token_type_ids, ... "visual_attention_mask": visual_attention_mask, ... } ... ) outputs = model(**inputs) last_hidden_state = outputs.last_hidden_state

VisualBertConfig

class transformers.VisualBertConfig

< source >

( vocab_size = 30522 hidden_size = 768 visual_embedding_dim = 512 num_hidden_layers = 12 num_attention_heads = 12 intermediate_size = 3072 hidden_act = 'gelu' hidden_dropout_prob = 0.1 attention_probs_dropout_prob = 0.1 max_position_embeddings = 512 type_vocab_size = 2 initializer_range = 0.02 layer_norm_eps = 1e-12 bypass_transformer = False special_visual_initialize = True pad_token_id = 1 bos_token_id = 0 eos_token_id = 2 **kwargs )

Parameters

This is the configuration class to store the configuration of a VisualBertModel. It is used to instantiate an VisualBERT model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the VisualBERTuclanlp/visualbert-vqa-coco-pre architecture.

Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.

Example:

from transformers import VisualBertConfig, VisualBertModel

configuration = VisualBertConfig.from_pretrained("uclanlp/visualbert-vqa-coco-pre")

model = VisualBertModel(configuration)

configuration = model.config

VisualBertModel

class transformers.VisualBertModel

< source >

( config add_pooling_layer = True )

Parameters

The model can behave as an encoder (with only self-attention) following the architecture described in Attention is all you need by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser and Illia Polosukhin.

This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward

< source >

( input_ids: typing.Optional[torch.LongTensor] = None attention_mask: typing.Optional[torch.LongTensor] = None token_type_ids: typing.Optional[torch.LongTensor] = None position_ids: typing.Optional[torch.LongTensor] = None head_mask: typing.Optional[torch.LongTensor] = None inputs_embeds: typing.Optional[torch.FloatTensor] = None visual_embeds: typing.Optional[torch.FloatTensor] = None visual_attention_mask: typing.Optional[torch.LongTensor] = None visual_token_type_ids: typing.Optional[torch.LongTensor] = None image_text_alignment: typing.Optional[torch.LongTensor] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None ) → transformers.modeling_outputs.BaseModelOutputWithPooling or tuple(torch.FloatTensor)

Parameters

A transformers.modeling_outputs.BaseModelOutputWithPooling or a tuple oftorch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (VisualBertConfig) and inputs.

The VisualBertModel forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Example:

from transformers import AutoTokenizer, VisualBertModel import torch

tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased") model = VisualBertModel.from_pretrained("uclanlp/visualbert-vqa-coco-pre")

inputs = tokenizer("The capital of France is Paris.", return_tensors="pt") visual_embeds = get_visual_embeddings(image).unsqueeze(0) visual_token_type_ids = torch.ones(visual_embeds.shape[:-1], dtype=torch.long) visual_attention_mask = torch.ones(visual_embeds.shape[:-1], dtype=torch.float)

inputs.update( { "visual_embeds": visual_embeds, "visual_token_type_ids": visual_token_type_ids, "visual_attention_mask": visual_attention_mask, } )

outputs = model(**inputs)

last_hidden_states = outputs.last_hidden_state

VisualBertForPreTraining

class transformers.VisualBertForPreTraining

< source >

( config )

Parameters

VisualBert Model with two heads on top as done during the pretraining: a masked language modeling head and asentence-image prediction (classification) head.

This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward

< source >

( input_ids: typing.Optional[torch.LongTensor] = None attention_mask: typing.Optional[torch.LongTensor] = None token_type_ids: typing.Optional[torch.LongTensor] = None position_ids: typing.Optional[torch.LongTensor] = None head_mask: typing.Optional[torch.LongTensor] = None inputs_embeds: typing.Optional[torch.FloatTensor] = None visual_embeds: typing.Optional[torch.FloatTensor] = None visual_attention_mask: typing.Optional[torch.LongTensor] = None visual_token_type_ids: typing.Optional[torch.LongTensor] = None image_text_alignment: typing.Optional[torch.LongTensor] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None labels: typing.Optional[torch.LongTensor] = None sentence_image_labels: typing.Optional[torch.LongTensor] = None ) → transformers.models.visual_bert.modeling_visual_bert.VisualBertForPreTrainingOutput or tuple(torch.FloatTensor)

Parameters

Returns

transformers.models.visual_bert.modeling_visual_bert.VisualBertForPreTrainingOutput or tuple(torch.FloatTensor)

A transformers.models.visual_bert.modeling_visual_bert.VisualBertForPreTrainingOutput or a tuple oftorch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (VisualBertConfig) and inputs.

The VisualBertForPreTraining forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Example:

from transformers import AutoTokenizer, VisualBertForPreTraining

tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased") model = VisualBertForPreTraining.from_pretrained("uclanlp/visualbert-vqa-coco-pre")

inputs = tokenizer("The capital of France is [MASK].", return_tensors="pt") visual_embeds = get_visual_embeddings(image).unsqueeze(0) visual_token_type_ids = torch.ones(visual_embeds.shape[:-1], dtype=torch.long) visual_attention_mask = torch.ones(visual_embeds.shape[:-1], dtype=torch.float)

inputs.update( { "visual_embeds": visual_embeds, "visual_token_type_ids": visual_token_type_ids, "visual_attention_mask": visual_attention_mask, } ) max_length = inputs["input_ids"].shape[-1] + visual_embeds.shape[-2] labels = tokenizer( "The capital of France is Paris.", return_tensors="pt", padding="max_length", max_length=max_length )["input_ids"] sentence_image_labels = torch.tensor(1).unsqueeze(0)

outputs = model(**inputs, labels=labels, sentence_image_labels=sentence_image_labels) loss = outputs.loss prediction_logits = outputs.prediction_logits seq_relationship_logits = outputs.seq_relationship_logits

VisualBertForQuestionAnswering

class transformers.VisualBertForQuestionAnswering

< source >

( config )

Parameters

VisualBert Model with a classification/regression head on top (a dropout and a linear layer on top of the pooled output) for VQA.

This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward

< source >

( input_ids: typing.Optional[torch.LongTensor] = None attention_mask: typing.Optional[torch.LongTensor] = None token_type_ids: typing.Optional[torch.LongTensor] = None position_ids: typing.Optional[torch.LongTensor] = None head_mask: typing.Optional[torch.LongTensor] = None inputs_embeds: typing.Optional[torch.FloatTensor] = None visual_embeds: typing.Optional[torch.FloatTensor] = None visual_attention_mask: typing.Optional[torch.LongTensor] = None visual_token_type_ids: typing.Optional[torch.LongTensor] = None image_text_alignment: typing.Optional[torch.LongTensor] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None labels: typing.Optional[torch.LongTensor] = None ) → transformers.modeling_outputs.SequenceClassifierOutput or tuple(torch.FloatTensor)

Parameters

A transformers.modeling_outputs.SequenceClassifierOutput or a tuple oftorch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (VisualBertConfig) and inputs.

The VisualBertForQuestionAnswering forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Example:

from transformers import AutoTokenizer, VisualBertForQuestionAnswering import torch

tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased") model = VisualBertForQuestionAnswering.from_pretrained("uclanlp/visualbert-vqa")

text = "Who is eating the apple?" inputs = tokenizer(text, return_tensors="pt") visual_embeds = get_visual_embeddings(image).unsqueeze(0) visual_token_type_ids = torch.ones(visual_embeds.shape[:-1], dtype=torch.long) visual_attention_mask = torch.ones(visual_embeds.shape[:-1], dtype=torch.float)

inputs.update( { "visual_embeds": visual_embeds, "visual_token_type_ids": visual_token_type_ids, "visual_attention_mask": visual_attention_mask, } )

labels = torch.tensor([[0.0, 1.0]]).unsqueeze(0)

outputs = model(**inputs, labels=labels) loss = outputs.loss scores = outputs.logits

VisualBertForMultipleChoice

class transformers.VisualBertForMultipleChoice

< source >

( config )

Parameters

The Visual Bert Model with a multiple choice classification head on top (a linear layer on top of the pooled output and a softmax) e.g. for RocStories/SWAG tasks.

This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward

< source >

( input_ids: typing.Optional[torch.LongTensor] = None attention_mask: typing.Optional[torch.LongTensor] = None token_type_ids: typing.Optional[torch.LongTensor] = None position_ids: typing.Optional[torch.LongTensor] = None head_mask: typing.Optional[torch.LongTensor] = None inputs_embeds: typing.Optional[torch.FloatTensor] = None visual_embeds: typing.Optional[torch.FloatTensor] = None visual_attention_mask: typing.Optional[torch.LongTensor] = None visual_token_type_ids: typing.Optional[torch.LongTensor] = None image_text_alignment: typing.Optional[torch.LongTensor] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None labels: typing.Optional[torch.LongTensor] = None ) → transformers.modeling_outputs.MultipleChoiceModelOutput or tuple(torch.FloatTensor)

Parameters

A transformers.modeling_outputs.MultipleChoiceModelOutput or a tuple oftorch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (VisualBertConfig) and inputs.

The VisualBertForMultipleChoice forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Example:

from transformers import AutoTokenizer, VisualBertForMultipleChoice import torch

tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased") model = VisualBertForMultipleChoice.from_pretrained("uclanlp/visualbert-vcr")

prompt = "In Italy, pizza served in formal settings, such as at a restaurant, is presented unsliced." choice0 = "It is eaten with a fork and a knife." choice1 = "It is eaten while held in the hand."

visual_embeds = get_visual_embeddings(image)

visual_embeds = visual_embeds.expand(1, 2, *visual_embeds.shape) visual_token_type_ids = torch.ones(visual_embeds.shape[:-1], dtype=torch.long) visual_attention_mask = torch.ones(visual_embeds.shape[:-1], dtype=torch.float)

labels = torch.tensor(0).unsqueeze(0)

encoding = tokenizer([[prompt, prompt], [choice0, choice1]], return_tensors="pt", padding=True)

inputs_dict = {k: v.unsqueeze(0) for k, v in encoding.items()} inputs_dict.update( { "visual_embeds": visual_embeds, "visual_attention_mask": visual_attention_mask, "visual_token_type_ids": visual_token_type_ids, "labels": labels, } ) outputs = model(**inputs_dict)

loss = outputs.loss logits = outputs.logits

VisualBertForVisualReasoning

class transformers.VisualBertForVisualReasoning

< source >

( config )

Parameters

VisualBert Model with a sequence classification head on top (a dropout and a linear layer on top of the pooled output) for Visual Reasoning e.g. for NLVR task.

This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward

< source >

( input_ids: typing.Optional[torch.LongTensor] = None attention_mask: typing.Optional[torch.LongTensor] = None token_type_ids: typing.Optional[torch.LongTensor] = None position_ids: typing.Optional[torch.LongTensor] = None head_mask: typing.Optional[torch.LongTensor] = None inputs_embeds: typing.Optional[torch.FloatTensor] = None visual_embeds: typing.Optional[torch.FloatTensor] = None visual_attention_mask: typing.Optional[torch.LongTensor] = None visual_token_type_ids: typing.Optional[torch.LongTensor] = None image_text_alignment: typing.Optional[torch.LongTensor] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None labels: typing.Optional[torch.LongTensor] = None ) → transformers.modeling_outputs.SequenceClassifierOutput or tuple(torch.FloatTensor)

Parameters

A transformers.modeling_outputs.SequenceClassifierOutput or a tuple oftorch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (VisualBertConfig) and inputs.

The VisualBertForVisualReasoning forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Example:

from transformers import AutoTokenizer, VisualBertForVisualReasoning import torch

tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased") model = VisualBertForVisualReasoning.from_pretrained("uclanlp/visualbert-nlvr2")

text = "Who is eating the apple?" inputs = tokenizer(text, return_tensors="pt") visual_embeds = get_visual_embeddings(image).unsqueeze(0) visual_token_type_ids = torch.ones(visual_embeds.shape[:-1], dtype=torch.long) visual_attention_mask = torch.ones(visual_embeds.shape[:-1], dtype=torch.float)

inputs.update( { "visual_embeds": visual_embeds, "visual_token_type_ids": visual_token_type_ids, "visual_attention_mask": visual_attention_mask, } )

labels = torch.tensor(1).unsqueeze(0)

outputs = model(**inputs, labels=labels) loss = outputs.loss scores = outputs.logits

VisualBertForRegionToPhraseAlignment

class transformers.VisualBertForRegionToPhraseAlignment

< source >

( config )

Parameters

VisualBert Model with a Masked Language Modeling head and an attention layer on top for Region-to-Phrase Alignment e.g. for Flickr30 Entities task.

This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward

< source >

( input_ids: typing.Optional[torch.LongTensor] = None attention_mask: typing.Optional[torch.LongTensor] = None token_type_ids: typing.Optional[torch.LongTensor] = None position_ids: typing.Optional[torch.LongTensor] = None head_mask: typing.Optional[torch.LongTensor] = None inputs_embeds: typing.Optional[torch.FloatTensor] = None visual_embeds: typing.Optional[torch.FloatTensor] = None visual_attention_mask: typing.Optional[torch.LongTensor] = None visual_token_type_ids: typing.Optional[torch.LongTensor] = None image_text_alignment: typing.Optional[torch.LongTensor] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None region_to_phrase_position: typing.Optional[torch.LongTensor] = None labels: typing.Optional[torch.LongTensor] = None ) → transformers.modeling_outputs.SequenceClassifierOutput or tuple(torch.FloatTensor)

Parameters

A transformers.modeling_outputs.SequenceClassifierOutput or a tuple oftorch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (VisualBertConfig) and inputs.

The VisualBertForRegionToPhraseAlignment forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Example:

from transformers import AutoTokenizer, VisualBertForRegionToPhraseAlignment import torch

tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased") model = VisualBertForRegionToPhraseAlignment.from_pretrained("uclanlp/visualbert-vqa-coco-pre")

text = "Who is eating the apple?" inputs = tokenizer(text, return_tensors="pt") visual_embeds = get_visual_embeddings(image).unsqueeze(0) visual_token_type_ids = torch.ones(visual_embeds.shape[:-1], dtype=torch.long) visual_attention_mask = torch.ones(visual_embeds.shape[:-1], dtype=torch.float) region_to_phrase_position = torch.ones((1, inputs["input_ids"].shape[-1] + visual_embeds.shape[-2]))

inputs.update( { "region_to_phrase_position": region_to_phrase_position, "visual_embeds": visual_embeds, "visual_token_type_ids": visual_token_type_ids, "visual_attention_mask": visual_attention_mask, } )

labels = torch.ones( (1, inputs["input_ids"].shape[-1] + visual_embeds.shape[-2], visual_embeds.shape[-2]) )

outputs = model(**inputs, labels=labels) loss = outputs.loss scores = outputs.logits

< > Update on GitHub