Perceiver (original) (raw)

PyTorch

Overview

The Perceiver IO model was proposed in Perceiver IO: A General Architecture for Structured Inputs & Outputs by Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier Hénaff, Matthew M. Botvinick, Andrew Zisserman, Oriol Vinyals, João Carreira.

Perceiver IO is a generalization of Perceiver to handle arbitrary outputs in addition to arbitrary inputs. The original Perceiver only produced a single classification label. In addition to classification labels, Perceiver IO can produce (for example) language, optical flow, and multimodal videos with audio. This is done using the same building blocks as the original Perceiver. The computational complexity of Perceiver IO is linear in the input and output size and the bulk of the processing occurs in the latent space, allowing us to process inputs and outputs that are much larger than can be handled by standard Transformers. This means, for example, Perceiver IO can do BERT-style masked language modeling directly using bytes instead of tokenized inputs.

The abstract from the paper is the following:

The recently-proposed Perceiver model obtains good results on several domains (images, audio, multimodal, point clouds) while scaling linearly in compute and memory with the input size. While the Perceiver supports many kinds of inputs, it can only produce very simple outputs such as class scores. Perceiver IO overcomes this limitation without sacrificing the original’s appealing properties by learning to flexibly query the model’s latent space to produce outputs of arbitrary size and semantics. Perceiver IO still decouples model depth from data size and still scales linearly with data size, but now with respect to both input and output sizes. The full Perceiver IO model achieves strong results on tasks with highly structured output spaces, such as natural language and visual understanding, StarCraft II, and multi-task and multi-modal domains. As highlights, Perceiver IO matches a Transformer-based BERT baseline on the GLUE language benchmark without the need for input tokenization and achieves state-of-the-art performance on Sintel optical flow estimation.

Here’s a TLDR explaining how Perceiver works:

The main problem with the self-attention mechanism of the Transformer is that the time and memory requirements scale quadratically with the sequence length. Hence, models like BERT and RoBERTa are limited to a max sequence length of 512 tokens. Perceiver aims to solve this issue by, instead of performing self-attention on the inputs, perform it on a set of latent variables, and only use the inputs for cross-attention. In this way, the time and memory requirements don’t depend on the length of the inputs anymore, as one uses a fixed amount of latent variables, like 256 or 512. These are randomly initialized, after which they are trained end-to-end using backpropagation.

Internally, PerceiverModel will create the latents, which is a tensor of shape (batch_size, num_latents, d_latents). One must provide inputs (which could be text, images, audio, you name it!) to the model, which it will use to perform cross-attention with the latents. The output of the Perceiver encoder is a tensor of the same shape. One can then, similar to BERT, convert the last hidden states of the latents to classification logits by averaging along the sequence dimension, and placing a linear layer on top of that to project the d_latents to num_labels.

This was the idea of the original Perceiver paper. However, it could only output classification logits. In a follow-up work, PerceiverIO, they generalized it to let the model also produce outputs of arbitrary size. How, you might ask? The idea is actually relatively simple: one defines outputs of an arbitrary size, and then applies cross-attention with the last hidden states of the latents, using the outputs as queries, and the latents as keys and values.

So let’s say one wants to perform masked language modeling (BERT-style) with the Perceiver. As the Perceiver’s input length will not have an impact on the computation time of the self-attention layers, one can provide raw bytes, providing inputs of length 2048 to the model. If one now masks out certain of these 2048 tokens, one can define theoutputs as being of shape: (batch_size, 2048, 768). Next, one performs cross-attention with the final hidden states of the latents to update the outputs tensor. After cross-attention, one still has a tensor of shape (batch_size, 2048, 768). One can then place a regular language modeling head on top, to project the last dimension to the vocabulary size of the model, i.e. creating logits of shape (batch_size, 2048, 262) (as Perceiver uses a vocabulary size of 262 byte IDs).

drawing Perceiver IO architecture. Taken from the original paper

This model was contributed by nielsr. The original code can be foundhere.

Perceiver does not work with torch.nn.DataParallel due to a bug in PyTorch, see issue #36035

Resources

Perceiver specific outputs

class transformers.models.perceiver.modeling_perceiver.PerceiverModelOutput

< source >

( logits: typing.Optional[torch.FloatTensor] = None last_hidden_state: typing.Optional[torch.FloatTensor] = None hidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = None attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None cross_attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None )

Parameters

Base class for Perceiver base model’s outputs, with potential hidden states, attentions and cross-attentions.

class transformers.models.perceiver.modeling_perceiver.PerceiverDecoderOutput

< source >

( logits: typing.Optional[torch.FloatTensor] = None cross_attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None )

Parameters

Base class for Perceiver decoder outputs, with potential cross-attentions.

class transformers.models.perceiver.modeling_perceiver.PerceiverMaskedLMOutput

< source >

( loss: typing.Optional[torch.FloatTensor] = None logits: typing.Optional[torch.FloatTensor] = None hidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = None attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None cross_attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None )

Parameters

Base class for Perceiver’s masked language model outputs.

class transformers.models.perceiver.modeling_perceiver.PerceiverClassifierOutput

< source >

( loss: typing.Optional[torch.FloatTensor] = None logits: typing.Optional[torch.FloatTensor] = None hidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = None attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None cross_attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None )

Parameters

Base class for Perceiver’s outputs of sequence/image classification models, optical flow and multimodal autoencoding.

PerceiverConfig

class transformers.PerceiverConfig

< source >

( num_latents = 256 d_latents = 1280 d_model = 768 num_blocks = 1 num_self_attends_per_block = 26 num_self_attention_heads = 8 num_cross_attention_heads = 8 qk_channels = None v_channels = None cross_attention_shape_for_attention = 'kv' self_attention_widening_factor = 1 cross_attention_widening_factor = 1 hidden_act = 'gelu' attention_probs_dropout_prob = 0.1 initializer_range = 0.02 layer_norm_eps = 1e-12 use_query_residual = True vocab_size = 262 max_position_embeddings = 2048 image_size = 56 train_size = [368, 496] num_frames = 16 audio_samples_per_frame = 1920 samples_per_patch = 16 output_shape = [1, 16, 224, 224] output_num_channels = 512 _label_trainable_num_channels = 1024 **kwargs )

Parameters

This is the configuration class to store the configuration of a PerceiverModel. It is used to instantiate an Perceiver model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the Perceiverdeepmind/language-perceiver architecture.

Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.

Example:

from transformers import PerceiverModel, PerceiverConfig

configuration = PerceiverConfig()

model = PerceiverModel(configuration)

configuration = model.config

PerceiverTokenizer

class transformers.PerceiverTokenizer

< source >

( pad_token = '[PAD]' bos_token = '[BOS]' eos_token = '[EOS]' mask_token = '[MASK]' cls_token = '[CLS]' sep_token = '[SEP]' model_max_length = 2048 **kwargs )

Parameters

Construct a Perceiver tokenizer. The Perceiver simply uses raw bytes utf-8 encoding.

This tokenizer inherits from PreTrainedTokenizer which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.

__call__

< source >

( text: typing.Union[str, typing.List[str], typing.List[typing.List[str]], NoneType] = None text_pair: typing.Union[str, typing.List[str], typing.List[typing.List[str]], NoneType] = None text_target: typing.Union[str, typing.List[str], typing.List[typing.List[str]], NoneType] = None text_pair_target: typing.Union[str, typing.List[str], typing.List[typing.List[str]], NoneType] = None add_special_tokens: bool = True padding: typing.Union[bool, str, transformers.utils.generic.PaddingStrategy] = False truncation: typing.Union[bool, str, transformers.tokenization_utils_base.TruncationStrategy, NoneType] = None max_length: typing.Optional[int] = None stride: int = 0 is_split_into_words: bool = False pad_to_multiple_of: typing.Optional[int] = None padding_side: typing.Optional[str] = None return_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = None return_token_type_ids: typing.Optional[bool] = None return_attention_mask: typing.Optional[bool] = None return_overflowing_tokens: bool = False return_special_tokens_mask: bool = False return_offsets_mapping: bool = False return_length: bool = False verbose: bool = True **kwargs ) → BatchEncoding

Parameters

A BatchEncoding with the following fields:

Main method to tokenize and prepare for the model one or several sequence(s) or one or several pair(s) of sequences.

PerceiverFeatureExtractor

Preprocess an image or a batch of images.

PerceiverImageProcessor

class transformers.PerceiverImageProcessor

< source >

( do_center_crop: bool = True crop_size: typing.Dict[str, int] = None do_resize: bool = True size: typing.Dict[str, int] = None resample: Resampling = <Resampling.BICUBIC: 3> do_rescale: bool = True rescale_factor: typing.Union[int, float] = 0.00392156862745098 do_normalize: bool = True image_mean: typing.Union[float, typing.List[float], NoneType] = None image_std: typing.Union[float, typing.List[float], NoneType] = None **kwargs )

Parameters

Constructs a Perceiver image processor.

preprocess

< source >

( images: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']] do_center_crop: typing.Optional[bool] = None crop_size: typing.Optional[typing.Dict[str, int]] = None do_resize: typing.Optional[bool] = None size: typing.Optional[typing.Dict[str, int]] = None resample: Resampling = None do_rescale: typing.Optional[bool] = None rescale_factor: typing.Optional[float] = None do_normalize: typing.Optional[bool] = None image_mean: typing.Union[float, typing.List[float], NoneType] = None image_std: typing.Union[float, typing.List[float], NoneType] = None return_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = None data_format: ChannelDimension = <ChannelDimension.FIRST: 'channels_first'> input_data_format: typing.Union[str, transformers.image_utils.ChannelDimension, NoneType] = None )

Parameters

Preprocess an image or batch of images.

PerceiverTextPreprocessor

class transformers.models.perceiver.modeling_perceiver.PerceiverTextPreprocessor

< source >

( config: PerceiverConfig )

Text preprocessing for Perceiver Encoder. Can be used to embed inputs and add positional encodings.

The dimensionality of the embeddings is determined by the d_model attribute of the configuration.

PerceiverImagePreprocessor

class transformers.models.perceiver.modeling_perceiver.PerceiverImagePreprocessor

< source >

( config prep_type = 'conv' spatial_downsample: int = 4 temporal_downsample: int = 1 position_encoding_type: str = 'fourier' in_channels: int = 3 out_channels: int = 64 conv_after_patching: bool = False conv_after_patching_in_channels: int = 54 conv2d_use_batchnorm: bool = True concat_or_add_pos: str = 'concat' project_pos_dim: int = -1 **position_encoding_kwargs )

Parameters

Image preprocessing for Perceiver Encoder.

Note: the out_channels argument refers to the output channels of a convolutional layer, if prep_type is set to “conv1x1” or “conv”. If one adds absolute position embeddings, one must make sure the num_channels of the position encoding kwargs are set equal to the out_channels.

PerceiverOneHotPreprocessor

class transformers.models.perceiver.modeling_perceiver.PerceiverOneHotPreprocessor

< source >

( config: PerceiverConfig )

One-hot preprocessor for Perceiver Encoder. Can be used to add a dummy index dimension to the input.

PerceiverAudioPreprocessor

class transformers.models.perceiver.modeling_perceiver.PerceiverAudioPreprocessor

< source >

( config prep_type: str = 'patches' samples_per_patch: int = 96 position_encoding_type: str = 'fourier' concat_or_add_pos: str = 'concat' out_channels = 64 project_pos_dim = -1 **position_encoding_kwargs )

Parameters

Audio preprocessing for Perceiver Encoder.

PerceiverMultimodalPreprocessor

class transformers.models.perceiver.modeling_perceiver.PerceiverMultimodalPreprocessor

< source >

( modalities: typing.Mapping[str, typing.Callable[..., typing.Tuple[torch.Tensor, typing.Optional[torch.Tensor], torch.Tensor]]] mask_probs: typing.Optional[typing.Mapping[str, float]] = None min_padding_size: int = 2 )

Parameters

Multimodal preprocessing for Perceiver Encoder.

Inputs for each modality are preprocessed, then padded with trainable position embeddings to have the same number of channels.

PerceiverProjectionDecoder

class transformers.models.perceiver.modeling_perceiver.PerceiverProjectionDecoder

< source >

( config )

Baseline projection decoder (no cross-attention).

PerceiverBasicDecoder

class transformers.models.perceiver.modeling_perceiver.PerceiverBasicDecoder

< source >

( config: PerceiverConfig output_num_channels: int position_encoding_type: typing.Optional[str] = 'trainable' output_index_dims: typing.Optional[int] = None num_channels: typing.Optional[int] = 128 subsampled_index_dims: typing.Optional[int] = None qk_channels: typing.Optional[int] = None v_channels: typing.Optional[int] = None num_heads: typing.Optional[int] = 1 widening_factor: typing.Optional[int] = 1 use_query_residual: typing.Optional[bool] = False concat_preprocessed_input: typing.Optional[bool] = False final_project: typing.Optional[bool] = True position_encoding_only: typing.Optional[bool] = False **position_encoding_kwargs )

Parameters

Cross-attention-based decoder. This class can be used to decode the final hidden states of the latents using a cross-attention operation, in which the latents produce keys and values.

The shape of the output of this class depends on how one defines the output queries (also called decoder queries).

PerceiverClassificationDecoder

class transformers.models.perceiver.modeling_perceiver.PerceiverClassificationDecoder

< source >

( config **decoder_kwargs )

Cross-attention based classification decoder. Light-weight wrapper of PerceiverBasicDecoder for logit output. Will turn the output of the Perceiver encoder which is of shape (batch_size, num_latents, d_latents) to a tensor of shape (batch_size, num_labels). The queries are of shape (batch_size, 1, num_labels).

PerceiverOpticalFlowDecoder

class transformers.models.perceiver.modeling_perceiver.PerceiverOpticalFlowDecoder

< source >

( config output_image_shape output_num_channels = 2 rescale_factor = 100.0 **decoder_kwargs )

Cross-attention based optical flow decoder.

PerceiverBasicVideoAutoencodingDecoder

class transformers.models.perceiver.modeling_perceiver.PerceiverBasicVideoAutoencodingDecoder

< source >

( config: PerceiverConfig output_shape: typing.List[int] position_encoding_type: str **decoder_kwargs )

Parameters

Cross-attention based video-autoencoding decoder. Light-weight wrapper of [_PerceiverBasicDecoder_] with video reshaping logic.

PerceiverMultimodalDecoder

class transformers.models.perceiver.modeling_perceiver.PerceiverMultimodalDecoder

< source >

( config: PerceiverConfig modalities: typing.Dict[str, transformers.models.perceiver.modeling_perceiver.PerceiverAbstractDecoder] num_outputs: int output_num_channels: int min_padding_size: typing.Optional[int] = 2 subsampled_index_dims: typing.Optional[typing.Dict[str, transformers.models.perceiver.modeling_perceiver.PerceiverAbstractDecoder]] = None **decoder_kwargs )

Parameters

Multimodal decoding by composing uni-modal decoders. The modalities argument of the constructor is a dictionary mapping modality name to the decoder of that modality. That decoder will be used to construct queries for that modality. Modality-specific queries are padded with trainable modality-specific parameters, after which they are concatenated along the time dimension.

Next, there is a shared cross attention operation across all modalities.

PerceiverProjectionPostprocessor

class transformers.models.perceiver.modeling_perceiver.PerceiverProjectionPostprocessor

< source >

( in_channels: int out_channels: int )

Parameters

Projection postprocessing for Perceiver. Can be used to project the channels of the decoder output to a lower dimension.

PerceiverAudioPostprocessor

class transformers.models.perceiver.modeling_perceiver.PerceiverAudioPostprocessor

< source >

( config: PerceiverConfig in_channels: int postproc_type: str = 'patches' )

Parameters

Audio postprocessing for Perceiver. Can be used to convert the decoder output to audio features.

PerceiverClassificationPostprocessor

class transformers.models.perceiver.modeling_perceiver.PerceiverClassificationPostprocessor

< source >

( config: PerceiverConfig in_channels: int )

Parameters

Classification postprocessing for Perceiver. Can be used to convert the decoder output to classification logits.

PerceiverMultimodalPostprocessor

class transformers.models.perceiver.modeling_perceiver.PerceiverMultimodalPostprocessor

< source >

( modalities: typing.Mapping[str, typing.Callable[..., typing.Any]] input_is_dict: bool = False )

Parameters

Multimodal postprocessing for Perceiver. Can be used to combine modality-specific postprocessors into a single postprocessor.

PerceiverModel

class transformers.PerceiverModel

< source >

( config decoder = None input_preprocessor: typing.Callable[..., typing.Tuple[torch.Tensor, typing.Optional[torch.Tensor], torch.Tensor]] = None output_postprocessor: typing.Callable[..., typing.Any] = None )

Parameters

The Perceiver: a scalable, fully attentional architecture.

Note that it’s possible to fine-tune Perceiver on higher resolution images than the ones it has been trained on, by setting interpolate_pos_encoding to True in the forward of the model. This will interpolate the pre-trained position embeddings to the higher resolution.

This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward

< source >

( inputs: FloatTensor attention_mask: typing.Optional[torch.FloatTensor] = None subsampled_output_points: typing.Optional[typing.Dict[str, torch.Tensor]] = None head_mask: typing.Optional[torch.FloatTensor] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None interpolate_pos_encoding: bool = False return_dict: typing.Optional[bool] = None ) → transformers.models.perceiver.modeling_perceiver.PerceiverModelOutput or tuple(torch.FloatTensor)

Parameters

A transformers.models.perceiver.modeling_perceiver.PerceiverModelOutput or a tuple oftorch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (PerceiverConfig) and inputs.

The PerceiverModel forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Examples:

from transformers import PerceiverConfig, PerceiverTokenizer, PerceiverImageProcessor, PerceiverModel from transformers.models.perceiver.modeling_perceiver import ( ... PerceiverTextPreprocessor, ... PerceiverImagePreprocessor, ... PerceiverClassificationDecoder, ... ) import torch import requests from PIL import Image

config = PerceiverConfig() preprocessor = PerceiverTextPreprocessor(config) decoder = PerceiverClassificationDecoder( ... config, ... num_channels=config.d_latents, ... trainable_position_encoding_kwargs=dict(num_channels=config.d_latents, index_dims=1), ... use_query_residual=True, ... ) model = PerceiverModel(config, input_preprocessor=preprocessor, decoder=decoder)

tokenizer = PerceiverTokenizer() text = "hello world" inputs = tokenizer(text, return_tensors="pt").input_ids

with torch.no_grad(): ... outputs = model(inputs=inputs) logits = outputs.logits list(logits.shape) [1, 2]

criterion = torch.nn.CrossEntropyLoss()

labels = torch.tensor([1]) loss = criterion(logits, labels)

config = PerceiverConfig(image_size=224) preprocessor = PerceiverImagePreprocessor( ... config, ... prep_type="conv1x1", ... spatial_downsample=1, ... out_channels=256, ... position_encoding_type="trainable", ... concat_or_add_pos="concat", ... project_pos_dim=256, ... trainable_position_encoding_kwargs=dict( ... num_channels=256, ... index_dims=config.image_size**2, ... ), ... )

model = PerceiverModel( ... config, ... input_preprocessor=preprocessor, ... decoder=PerceiverClassificationDecoder( ... config, ... num_channels=config.d_latents, ... trainable_position_encoding_kwargs=dict(num_channels=config.d_latents, index_dims=1), ... use_query_residual=True, ... ), ... )

image_processor = PerceiverImageProcessor() url = "http://images.cocodataset.org/val2017/000000039769.jpg" image = Image.open(requests.get(url, stream=True).raw) inputs = image_processor(image, return_tensors="pt").pixel_values

with torch.no_grad(): ... outputs = model(inputs=inputs) logits = outputs.logits list(logits.shape) [1, 2]

criterion = torch.nn.CrossEntropyLoss()

labels = torch.tensor([1]) loss = criterion(logits, labels)

PerceiverForMaskedLM

class transformers.PerceiverForMaskedLM

< source >

( config: PerceiverConfig )

Parameters

Example use of Perceiver for masked language modeling. This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward

< source >

( inputs: typing.Optional[torch.Tensor] = None attention_mask: typing.Optional[torch.Tensor] = None head_mask: typing.Optional[torch.Tensor] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None labels: typing.Optional[torch.Tensor] = None return_dict: typing.Optional[bool] = None input_ids: typing.Optional[torch.Tensor] = None ) → transformers.models.perceiver.modeling_perceiver.PerceiverMaskedLMOutput or tuple(torch.FloatTensor)

Parameters

A transformers.models.perceiver.modeling_perceiver.PerceiverMaskedLMOutput or a tuple oftorch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (PerceiverConfig) and inputs.

The PerceiverForMaskedLM forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Examples:

from transformers import AutoTokenizer, PerceiverForMaskedLM import torch

tokenizer = AutoTokenizer.from_pretrained("deepmind/language-perceiver") model = PerceiverForMaskedLM.from_pretrained("deepmind/language-perceiver")

text = "This is an incomplete sentence where some words are missing." inputs = tokenizer(text, padding="max_length", return_tensors="pt")

inputs["input_ids"][0, 52:61] = tokenizer.mask_token_id labels = tokenizer(text, padding="max_length", return_tensors="pt").input_ids

outputs = model(**inputs, labels=labels) loss = outputs.loss round(loss.item(), 2) 19.87

logits = outputs.logits list(logits.shape) [1, 2048, 262]

text = "This is an incomplete sentence where some words are missing." encoding = tokenizer(text, padding="max_length", return_tensors="pt")

encoding["input_ids"][0, 52:61] = tokenizer.mask_token_id

with torch.no_grad(): ... outputs = model(**encoding) logits = outputs.logits list(logits.shape) [1, 2048, 262]

masked_tokens_predictions = logits[0, 52:61].argmax(dim=-1).tolist() tokenizer.decode(masked_tokens_predictions) ' missing.'

PerceiverForSequenceClassification

class transformers.PerceiverForSequenceClassification

< source >

( config )

Parameters

Example use of Perceiver for text classification. This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward

< source >

( inputs: typing.Optional[torch.Tensor] = None attention_mask: typing.Optional[torch.Tensor] = None head_mask: typing.Optional[torch.Tensor] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None labels: typing.Optional[torch.Tensor] = None return_dict: typing.Optional[bool] = None input_ids: typing.Optional[torch.Tensor] = None ) → transformers.models.perceiver.modeling_perceiver.PerceiverClassifierOutput or tuple(torch.FloatTensor)

Parameters

A transformers.models.perceiver.modeling_perceiver.PerceiverClassifierOutput or a tuple oftorch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (PerceiverConfig) and inputs.

The PerceiverForSequenceClassification forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Examples:

from transformers import AutoTokenizer, PerceiverForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("deepmind/language-perceiver") model = PerceiverForSequenceClassification.from_pretrained("deepmind/language-perceiver")

text = "hello world" inputs = tokenizer(text, return_tensors="pt").input_ids outputs = model(inputs=inputs) logits = outputs.logits list(logits.shape) [1, 2]

PerceiverForImageClassificationLearned

class transformers.PerceiverForImageClassificationLearned

< source >

( config )

Parameters

Example use of Perceiver for image classification, for tasks such as ImageNet.

This model uses learned position embeddings. In other words, this model is not given any privileged information about the structure of images. As shown in the paper, this model can achieve a top-1 accuracy of 72.7 on ImageNet.

PerceiverForImageClassificationLearned uses PerceiverImagePreprocessor(with prep_type="conv1x1") to preprocess the input images, andPerceiverClassificationDecoder to decode the latent representation ofPerceiverModel into classification logits.

This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward

< source >

( inputs: typing.Optional[torch.Tensor] = None attention_mask: typing.Optional[torch.Tensor] = None head_mask: typing.Optional[torch.Tensor] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None labels: typing.Optional[torch.Tensor] = None interpolate_pos_encoding: bool = False return_dict: typing.Optional[bool] = None pixel_values: typing.Optional[torch.Tensor] = None ) → transformers.models.perceiver.modeling_perceiver.PerceiverClassifierOutput or tuple(torch.FloatTensor)

Parameters

A transformers.models.perceiver.modeling_perceiver.PerceiverClassifierOutput or a tuple oftorch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (PerceiverConfig) and inputs.

The PerceiverForImageClassificationLearned forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Examples:

from transformers import AutoImageProcessor, PerceiverForImageClassificationLearned from PIL import Image import requests

url = "http://images.cocodataset.org/val2017/000000039769.jpg" image = Image.open(requests.get(url, stream=True).raw)

image_processor = AutoImageProcessor.from_pretrained("deepmind/vision-perceiver-learned") model = PerceiverForImageClassificationLearned.from_pretrained("deepmind/vision-perceiver-learned")

inputs = image_processor(images=image, return_tensors="pt").pixel_values outputs = model(inputs=inputs) logits = outputs.logits list(logits.shape) [1, 1000]

predicted_class_idx = logits.argmax(-1).item() print("Predicted class:", model.config.id2label[predicted_class_idx]) Predicted class: tabby, tabby cat

PerceiverForImageClassificationFourier

class transformers.PerceiverForImageClassificationFourier

< source >

( config )

Parameters

Example use of Perceiver for image classification, for tasks such as ImageNet.

This model uses fixed 2D Fourier position embeddings. As shown in the paper, this model can achieve a top-1 accuracy of 79.0 on ImageNet, and 84.5 when pre-trained on a large-scale dataset (i.e. JFT).

PerceiverForImageClassificationLearned uses PerceiverImagePreprocessor(with prep_type="pixels") to preprocess the input images, andPerceiverClassificationDecoder to decode the latent representation ofPerceiverModel into classification logits.

This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward

< source >

( inputs: typing.Optional[torch.Tensor] = None attention_mask: typing.Optional[torch.Tensor] = None head_mask: typing.Optional[torch.Tensor] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None labels: typing.Optional[torch.Tensor] = None return_dict: typing.Optional[bool] = None pixel_values: typing.Optional[torch.Tensor] = None ) → transformers.models.perceiver.modeling_perceiver.PerceiverClassifierOutput or tuple(torch.FloatTensor)

Parameters

A transformers.models.perceiver.modeling_perceiver.PerceiverClassifierOutput or a tuple oftorch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (PerceiverConfig) and inputs.

The PerceiverForImageClassificationFourier forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Examples:

from transformers import AutoImageProcessor, PerceiverForImageClassificationFourier from PIL import Image import requests

url = "http://images.cocodataset.org/val2017/000000039769.jpg" image = Image.open(requests.get(url, stream=True).raw)

image_processor = AutoImageProcessor.from_pretrained("deepmind/vision-perceiver-fourier") model = PerceiverForImageClassificationFourier.from_pretrained("deepmind/vision-perceiver-fourier")

inputs = image_processor(images=image, return_tensors="pt").pixel_values outputs = model(inputs=inputs) logits = outputs.logits list(logits.shape) [1, 1000]

predicted_class_idx = logits.argmax(-1).item() print("Predicted class:", model.config.id2label[predicted_class_idx]) Predicted class: tabby, tabby cat

PerceiverForImageClassificationConvProcessing

class transformers.PerceiverForImageClassificationConvProcessing

< source >

( config )

Parameters

Example use of Perceiver for image classification, for tasks such as ImageNet.

This model uses a 2D conv+maxpool preprocessing network. As shown in the paper, this model can achieve a top-1 accuracy of 82.1 on ImageNet.

PerceiverForImageClassificationLearned uses PerceiverImagePreprocessor(with prep_type="conv") to preprocess the input images, andPerceiverClassificationDecoder to decode the latent representation ofPerceiverModel into classification logits.

This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward

< source >

( inputs: typing.Optional[torch.Tensor] = None attention_mask: typing.Optional[torch.Tensor] = None head_mask: typing.Optional[torch.Tensor] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None labels: typing.Optional[torch.Tensor] = None return_dict: typing.Optional[bool] = None pixel_values: typing.Optional[torch.Tensor] = None ) → transformers.models.perceiver.modeling_perceiver.PerceiverClassifierOutput or tuple(torch.FloatTensor)

Parameters

A transformers.models.perceiver.modeling_perceiver.PerceiverClassifierOutput or a tuple oftorch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (PerceiverConfig) and inputs.

The PerceiverForImageClassificationConvProcessing forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Examples:

from transformers import AutoImageProcessor, PerceiverForImageClassificationConvProcessing from PIL import Image import requests

url = "http://images.cocodataset.org/val2017/000000039769.jpg" image = Image.open(requests.get(url, stream=True).raw)

image_processor = AutoImageProcessor.from_pretrained("deepmind/vision-perceiver-conv") model = PerceiverForImageClassificationConvProcessing.from_pretrained("deepmind/vision-perceiver-conv")

inputs = image_processor(images=image, return_tensors="pt").pixel_values outputs = model(inputs=inputs) logits = outputs.logits list(logits.shape) [1, 1000]

predicted_class_idx = logits.argmax(-1).item() print("Predicted class:", model.config.id2label[predicted_class_idx]) Predicted class: tabby, tabby cat

PerceiverForOpticalFlow

class transformers.PerceiverForOpticalFlow

< source >

( config )

Parameters

Example use of Perceiver for optical flow, for tasks such as Sintel and KITTI. PerceiverForOpticalFlow usesPerceiverImagePreprocessor (with prep_type=“patches”) to preprocess the input images, and PerceiverOpticalFlowDecoder to decode the latent representation of PerceiverModel.

As input, one concatenates 2 subsequent frames along the channel dimension and extract a 3 x 3 patch around each pixel (leading to 3 x 3 x 3 x 2 = 54 values for each pixel). Fixed Fourier position encodings are used to encode the position of each pixel in the patch. Next, one applies the Perceiver encoder. To decode, one queries the latent representation using the same encoding used for the input.

This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward

< source >

( inputs: typing.Optional[torch.Tensor] = None attention_mask: typing.Optional[torch.Tensor] = None head_mask: typing.Optional[torch.Tensor] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None labels: typing.Optional[torch.Tensor] = None return_dict: typing.Optional[bool] = None ) → transformers.models.perceiver.modeling_perceiver.PerceiverClassifierOutput or tuple(torch.FloatTensor)

Parameters

A transformers.models.perceiver.modeling_perceiver.PerceiverClassifierOutput or a tuple oftorch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (PerceiverConfig) and inputs.

The PerceiverForOpticalFlow forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Examples:

from transformers import PerceiverForOpticalFlow import torch

model = PerceiverForOpticalFlow.from_pretrained("deepmind/optical-flow-perceiver")

patches = torch.randn(1, 2, 27, 368, 496) outputs = model(inputs=patches) logits = outputs.logits list(logits.shape) [1, 368, 496, 2]

PerceiverForMultimodalAutoencoding

class transformers.PerceiverForMultimodalAutoencoding

< source >

( config: PerceiverConfig )

Parameters

Example use of Perceiver for multimodal (video) autoencoding, for tasks such as Kinetics-700.

PerceiverForMultimodalAutoencoding uses PerceiverMultimodalPreprocessor to preprocess the 3 modalities: images, audio and class labels. This preprocessor uses modality-specific preprocessors to preprocess every modality separately, after which they are concatenated. Trainable position embeddings are used to pad each modality to the same number of channels to make concatenation along the time dimension possible. Next, one applies the Perceiver encoder.

PerceiverMultimodalDecoder is used to decode the latent representation ofPerceiverModel. This decoder uses each modality-specific decoder to construct queries. The decoder queries are created based on the inputs after preprocessing. However, autoencoding an entire video in a single forward pass is computationally infeasible, hence one only uses parts of the decoder queries to do cross-attention with the latent representation. This is determined by the subsampled indices for each modality, which can be provided as additional input to the forward pass of PerceiverForMultimodalAutoencoding.

PerceiverMultimodalDecoder also pads the decoder queries of the different modalities to the same number of channels, in order to concatenate them along the time dimension. Next, cross-attention is performed with the latent representation of PerceiverModel.

Finally, ~models.perceiver.modeling_perceiver.PerceiverMultiModalPostprocessor is used to turn this tensor into an actual video. It first splits up the output into the different modalities, and then applies the respective postprocessor for each modality.

Note that, by masking the classification label during evaluation (i.e. simply providing a tensor of zeros for the “label” modality), this auto-encoding model becomes a Kinetics 700 video classifier.

This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward

< source >

( inputs: typing.Optional[torch.Tensor] = None attention_mask: typing.Optional[torch.Tensor] = None subsampled_output_points: typing.Optional[typing.Dict[str, torch.Tensor]] = None head_mask: typing.Optional[torch.Tensor] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None labels: typing.Optional[torch.Tensor] = None return_dict: typing.Optional[bool] = None ) → transformers.models.perceiver.modeling_perceiver.PerceiverClassifierOutput or tuple(torch.FloatTensor)

Parameters

A transformers.models.perceiver.modeling_perceiver.PerceiverClassifierOutput or a tuple oftorch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (PerceiverConfig) and inputs.

The PerceiverForMultimodalAutoencoding forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Examples:

from transformers import PerceiverForMultimodalAutoencoding import torch import numpy as np

images = torch.randn((1, 16, 3, 224, 224)) audio = torch.randn((1, 30720, 1)) inputs = dict(image=images, audio=audio, label=torch.zeros((images.shape[0], 700)))

model = PerceiverForMultimodalAutoencoding.from_pretrained("deepmind/multimodal-perceiver")

nchunks = 128 image_chunk_size = np.prod((16, 224, 224)) // nchunks audio_chunk_size = audio.shape[1] // model.config.samples_per_patch // nchunks

chunk_idx = 0 subsampling = { ... "image": torch.arange(image_chunk_size * chunk_idx, image_chunk_size * (chunk_idx + 1)), ... "audio": torch.arange(audio_chunk_size * chunk_idx, audio_chunk_size * (chunk_idx + 1)), ... "label": None, ... }

outputs = model(inputs=inputs, subsampled_output_points=subsampling) logits = outputs.logits list(logits["audio"].shape) [1, 240]

list(logits["image"].shape) [1, 6272, 3]

list(logits["label"].shape) [1, 700]

< > Update on GitHub