Convolutional Vision Transformer (CvT) (original) (raw)

PyTorch TensorFlow

Overview

The CvT model was proposed in CvT: Introducing Convolutions to Vision Transformers by Haiping Wu, Bin Xiao, Noel Codella, Mengchen Liu, Xiyang Dai, Lu Yuan and Lei Zhang. The Convolutional vision Transformer (CvT) improves the Vision Transformer (ViT) in performance and efficiency by introducing convolutions into ViT to yield the best of both designs.

The abstract from the paper is the following:

We present in this paper a new architecture, named Convolutional vision Transformer (CvT), that improves Vision Transformer (ViT) in performance and efficiency by introducing convolutions into ViT to yield the best of both designs. This is accomplished through two primary modifications: a hierarchy of Transformers containing a new convolutional token embedding, and a convolutional Transformer block leveraging a convolutional projection. These changes introduce desirable properties of convolutional neural networks (CNNs) to the ViT architecture (\ie shift, scale, and distortion invariance) while maintaining the merits of Transformers (\ie dynamic attention, global context, and better generalization). We validate CvT by conducting extensive experiments, showing that this approach achieves state-of-the-art performance over other Vision Transformers and ResNets on ImageNet-1k, with fewer parameters and lower FLOPs. In addition, performance gains are maintained when pretrained on larger datasets (\eg ImageNet-22k) and fine-tuned to downstream tasks. Pre-trained on ImageNet-22k, our CvT-W24 obtains a top-1 accuracy of 87.7\% on the ImageNet-1k val set. Finally, our results show that the positional encoding, a crucial component in existing Vision Transformers, can be safely removed in our model, simplifying the design for higher resolution vision tasks.

This model was contributed by anugunj. The original code can be found here.

Usage tips

Resources

A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with CvT.

If you’re interested in submitting a resource to be included here, please feel free to open a Pull Request and we’ll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.

CvtConfig

class transformers.CvtConfig

< source >

( num_channels = 3 patch_sizes = [7, 3, 3] patch_stride = [4, 2, 2] patch_padding = [2, 1, 1] embed_dim = [64, 192, 384] num_heads = [1, 3, 6] depth = [1, 2, 10] mlp_ratio = [4.0, 4.0, 4.0] attention_drop_rate = [0.0, 0.0, 0.0] drop_rate = [0.0, 0.0, 0.0] drop_path_rate = [0.0, 0.0, 0.1] qkv_bias = [True, True, True] cls_token = [False, False, True] qkv_projection_method = ['dw_bn', 'dw_bn', 'dw_bn'] kernel_qkv = [3, 3, 3] padding_kv = [1, 1, 1] stride_kv = [2, 2, 2] padding_q = [1, 1, 1] stride_q = [1, 1, 1] initializer_range = 0.02 layer_norm_eps = 1e-12 **kwargs )

Parameters

This is the configuration class to store the configuration of a CvtModel. It is used to instantiate a CvT model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the CvTmicrosoft/cvt-13 architecture.

Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.

Example:

from transformers import CvtConfig, CvtModel

configuration = CvtConfig()

model = CvtModel(configuration)

configuration = model.config

CvtModel

class transformers.CvtModel

< source >

( config add_pooling_layer = True )

Parameters

The bare Cvt Model outputting raw hidden-states without any specific head on top.

This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward

< source >

( pixel_values: typing.Optional[torch.Tensor] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None ) β†’ transformers.models.cvt.modeling_cvt.BaseModelOutputWithCLSToken or tuple(torch.FloatTensor)

Parameters

Returns

transformers.models.cvt.modeling_cvt.BaseModelOutputWithCLSToken or tuple(torch.FloatTensor)

A transformers.models.cvt.modeling_cvt.BaseModelOutputWithCLSToken or a tuple oftorch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (CvtConfig) and inputs.

The CvtModel forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

CvtForImageClassification

class transformers.CvtForImageClassification

< source >

( config )

Parameters

Cvt Model transformer with an image classification head on top (a linear layer on top of the final hidden state of the [CLS] token) e.g. for ImageNet.

This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward

< source >

( pixel_values: typing.Optional[torch.Tensor] = None labels: typing.Optional[torch.Tensor] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None ) β†’ transformers.modeling_outputs.ImageClassifierOutputWithNoAttention or tuple(torch.FloatTensor)

Parameters

A transformers.modeling_outputs.ImageClassifierOutputWithNoAttention or a tuple oftorch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (CvtConfig) and inputs.

The CvtForImageClassification forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Example:

from transformers import AutoImageProcessor, CvtForImageClassification import torch from datasets import load_dataset

dataset = load_dataset("huggingface/cats-image", trust_remote_code=True) image = dataset["test"]["image"][0]

image_processor = AutoImageProcessor.from_pretrained("microsoft/cvt-13") model = CvtForImageClassification.from_pretrained("microsoft/cvt-13")

inputs = image_processor(image, return_tensors="pt")

with torch.no_grad(): ... logits = model(**inputs).logits

predicted_label = logits.argmax(-1).item() print(model.config.id2label[predicted_label]) ...

TFCvtModel

class transformers.TFCvtModel

< source >

( config: CvtConfig *inputs **kwargs )

Parameters

The bare Cvt Model transformer outputting raw hidden-states without any specific head on top.

This model inherits from TFPreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a keras.Model subclass. Use it as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to general usage and behavior.

TF 2.0 models accepts two formats as inputs:

This second option is useful when using keras.Model.fit method which currently requires having all the tensors in the first argument of the model call function: model(inputs).

call

< source >

( pixel_values: tf.Tensor | None = None output_hidden_states: Optional[bool] = None return_dict: Optional[bool] = None training: Optional[bool] = False ) β†’ transformers.models.cvt.modeling_tf_cvt.TFBaseModelOutputWithCLSToken or tuple(tf.Tensor)

Parameters

Returns

transformers.models.cvt.modeling_tf_cvt.TFBaseModelOutputWithCLSToken or tuple(tf.Tensor)

A transformers.models.cvt.modeling_tf_cvt.TFBaseModelOutputWithCLSToken or a tuple of tf.Tensor (ifreturn_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (CvtConfig) and inputs.

The TFCvtModel forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Examples:

from transformers import AutoImageProcessor, TFCvtModel from PIL import Image import requests

url = "http://images.cocodataset.org/val2017/000000039769.jpg" image = Image.open(requests.get(url, stream=True).raw)

image_processor = AutoImageProcessor.from_pretrained("microsoft/cvt-13") model = TFCvtModel.from_pretrained("microsoft/cvt-13")

inputs = image_processor(images=image, return_tensors="tf") outputs = model(**inputs) last_hidden_states = outputs.last_hidden_state

TFCvtForImageClassification

class transformers.TFCvtForImageClassification

< source >

( config: CvtConfig *inputs **kwargs )

Parameters

Cvt Model transformer with an image classification head on top (a linear layer on top of the final hidden state of the [CLS] token) e.g. for ImageNet.

This model inherits from TFPreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a keras.Model subclass. Use it as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to general usage and behavior.

TF 2.0 models accepts two formats as inputs:

This second option is useful when using keras.Model.fit method which currently requires having all the tensors in the first argument of the model call function: model(inputs).

call

< source >

( pixel_values: tf.Tensor | None = None labels: tf.Tensor | None = None output_hidden_states: Optional[bool] = None return_dict: Optional[bool] = None training: Optional[bool] = False ) β†’ transformers.modeling_tf_outputs.TFImageClassifierOutputWithNoAttention or tuple(tf.Tensor)

Parameters

Returns

transformers.modeling_tf_outputs.TFImageClassifierOutputWithNoAttention or tuple(tf.Tensor)

A transformers.modeling_tf_outputs.TFImageClassifierOutputWithNoAttention or a tuple of tf.Tensor (ifreturn_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (CvtConfig) and inputs.

The TFCvtForImageClassification forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Examples:

from transformers import AutoImageProcessor, TFCvtForImageClassification import tensorflow as tf from PIL import Image import requests

url = "http://images.cocodataset.org/val2017/000000039769.jpg" image = Image.open(requests.get(url, stream=True).raw)

image_processor = AutoImageProcessor.from_pretrained("microsoft/cvt-13") model = TFCvtForImageClassification.from_pretrained("microsoft/cvt-13")

inputs = image_processor(images=image, return_tensors="tf") outputs = model(**inputs) logits = outputs.logits

predicted_class_idx = tf.math.argmax(logits, axis=-1)[0] print("Predicted class:", model.config.id2label[int(predicted_class_idx)])

< > Update on GitHub