SegFormer (original) (raw)


The SegFormer model was proposed in SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers by Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo. The model consists of a hierarchical Transformer encoder and a lightweight all-MLP decode head to achieve great results on image segmentation benchmarks such as ADE20K and Cityscapes.

The abstract from the paper is the following:

We present SegFormer, a simple, efficient yet powerful semantic segmentation framework which unifies Transformers with lightweight multilayer perception (MLP) decoders. SegFormer has two appealing features: 1) SegFormer comprises a novel hierarchically structured Transformer encoder which outputs multiscale features. It does not need positional encoding, thereby avoiding the interpolation of positional codes which leads to decreased performance when the testing resolution differs from training. 2) SegFormer avoids complex decoders. The proposed MLP decoder aggregates information from different layers, and thus combining both local attention and global attention to render powerful representations. We show that this simple and lightweight design is the key to efficient segmentation on Transformers. We scale our approach up to obtain a series of models from SegFormer-B0 to SegFormer-B5, reaching significantly better performance and efficiency than previous counterparts. For example, SegFormer-B4 achieves 50.3% mIoU on ADE20K with 64M parameters, being 5x smaller and 2.2% better than the previous best method. Our best model, SegFormer-B5, achieves 84.0% mIoU on Cityscapes validation set and shows excellent zero-shot robustness on Cityscapes-C.

The figure below illustrates the architecture of SegFormer. Taken from the original paper.

This model was contributed by nielsr. The TensorFlow version of the model was contributed by sayakpaul. The original code can be found here.

Usage tips

Model variant Depths Hidden sizes Decoder hidden size Params (M) ImageNet-1k Top 1
MiT-b0 [2, 2, 2, 2] [32, 64, 160, 256] 256 3.7 70.5
MiT-b1 [2, 2, 2, 2] [64, 128, 320, 512] 256 14.0 78.7
MiT-b2 [3, 4, 6, 3] [64, 128, 320, 512] 768 25.4 81.6
MiT-b3 [3, 4, 18, 3] [64, 128, 320, 512] 768 45.2 83.1
MiT-b4 [3, 8, 27, 3] [64, 128, 320, 512] 768 62.6 83.6
MiT-b5 [3, 6, 40, 3] [64, 128, 320, 512] 768 82.0 83.8

Note that MiT in the above table refers to the Mix Transformer encoder backbone introduced in SegFormer. For SegFormer’s results on the segmentation datasets like ADE20k, refer to the paper.


A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with SegFormer.

Semantic segmentation:

If you’re interested in submitting a resource to be included here, please feel free to open a Pull Request and we’ll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.


class transformers.SegformerConfig

< source >

( num_channels = 3 num_encoder_blocks = 4 depths = [2, 2, 2, 2] sr_ratios = [8, 4, 2, 1] hidden_sizes = [32, 64, 160, 256] patch_sizes = [7, 3, 3, 3] strides = [4, 2, 2, 2] num_attention_heads = [1, 2, 5, 8] mlp_ratios = [4, 4, 4, 4] hidden_act = 'gelu' hidden_dropout_prob = 0.0 attention_probs_dropout_prob = 0.0 classifier_dropout_prob = 0.1 initializer_range = 0.02 drop_path_rate = 0.1 layer_norm_eps = 1e-06 decoder_hidden_size = 256 semantic_loss_ignore_index = 255 **kwargs )


This is the configuration class to store the configuration of a SegformerModel. It is used to instantiate an SegFormer model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the SegFormernvidia/segformer-b0-finetuned-ade-512-512architecture.

Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.


from transformers import SegformerModel, SegformerConfig

configuration = SegformerConfig()

model = SegformerModel(configuration)

configuration = model.config


( images segmentation_maps = None **kwargs )

Preprocesses a batch of images and optionally segmentation maps.

Overrides the __call__ method of the Preprocessor class so that both images and segmentation maps can be passed in as positional arguments.

( outputs target_sizes: List = None ) → semantic_segmentation


List[torch.Tensor] of length batch_size, where each item is a semantic segmentation map of shape (height, width) corresponding to the target_sizes entry (if target_sizes is specified). Each entry of each torch.Tensor correspond to a semantic class id.

Converts the output of SegformerForSemanticSegmentation into semantic segmentation maps. Only supports PyTorch.


class transformers.SegformerImageProcessor

< source >

( do_resize: bool = True size: Dict = None resample: Resampling = <Resampling.BILINEAR: 2> do_rescale: bool = True rescale_factor: Union = 0.00392156862745098 do_normalize: bool = True image_mean: Union = None image_std: Union = None do_reduce_labels: bool = False **kwargs )


Constructs a Segformer image processor.


< source >

( images: Union segmentation_maps: Union = None do_resize: Optional = None size: Optional = None resample: Resampling = None do_rescale: Optional = None rescale_factor: Optional = None do_normalize: Optional = None image_mean: Union = None image_std: Union = None do_reduce_labels: Optional = None return_tensors: Union = None data_format: ChannelDimension = <ChannelDimension.FIRST: 'channels_first'> input_data_format: Union = None )


Preprocess an image or batch of images.


< source >

( outputs target_sizes: List = None ) → semantic_segmentation




List[torch.Tensor] of length batch_size, where each item is a semantic segmentation map of shape (height, width) corresponding to the target_sizes entry (if target_sizes is specified). Each entry of each torch.Tensor correspond to a semantic class id.

Converts the output of SegformerForSemanticSegmentation into semantic segmentation maps. Only supports PyTorch.


class transformers.SegformerModel

< source >

( config )


The bare SegFormer encoder (Mix-Transformer) outputting raw hidden-states without any specific head on top. This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.


< source >

( pixel_values: FloatTensor output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None ) → transformers.modeling_outputs.BaseModelOutput or tuple(torch.FloatTensor)


A transformers.modeling_outputs.BaseModelOutput or a tuple oftorch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (SegformerConfig) and inputs.

The SegformerModel forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.


from transformers import AutoImageProcessor, SegformerModel import torch from datasets import load_dataset

dataset = load_dataset("huggingface/cats-image", trust_remote_code=True) image = dataset["test"]["image"][0]

image_processor = AutoImageProcessor.from_pretrained("nvidia/mit-b0") model = SegformerModel.from_pretrained("nvidia/mit-b0")

inputs = image_processor(image, return_tensors="pt")

with torch.no_grad(): ... outputs = model(**inputs)

last_hidden_states = outputs.last_hidden_state list(last_hidden_states.shape) [1, 256, 16, 16]


class transformers.SegformerDecodeHead

< source >

( config )


< source >

( encoder_hidden_states: FloatTensor )


class transformers.SegformerForImageClassification

< source >

( config )


SegFormer Model transformer with an image classification head on top (a linear layer on top of the final hidden states) e.g. for ImageNet.

This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.


< source >

( pixel_values: Optional = None labels: Optional = None output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None ) → transformers.models.segformer.modeling_segformer.SegFormerImageClassifierOutput or tuple(torch.FloatTensor)



transformers.models.segformer.modeling_segformer.SegFormerImageClassifierOutput or tuple(torch.FloatTensor)

A transformers.models.segformer.modeling_segformer.SegFormerImageClassifierOutput or a tuple oftorch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (SegformerConfig) and inputs.

The SegformerForImageClassification forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.


from transformers import AutoImageProcessor, SegformerForImageClassification import torch from datasets import load_dataset

dataset = load_dataset("huggingface/cats-image", trust_remote_code=True) image = dataset["test"]["image"][0]

image_processor = AutoImageProcessor.from_pretrained("nvidia/mit-b0") model = SegformerForImageClassification.from_pretrained("nvidia/mit-b0")

inputs = image_processor(image, return_tensors="pt")

with torch.no_grad(): ... logits = model(**inputs).logits

predicted_label = logits.argmax(-1).item() print(model.config.id2label[predicted_label]) tabby, tabby cat


class transformers.SegformerForSemanticSegmentation

< source >

( config )


SegFormer Model transformer with an all-MLP decode head on top e.g. for ADE20k, CityScapes. This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.


< source >

( pixel_values: FloatTensor labels: Optional = None output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None ) → transformers.modeling_outputs.SemanticSegmenterOutput or tuple(torch.FloatTensor)


A transformers.modeling_outputs.SemanticSegmenterOutput or a tuple oftorch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (SegformerConfig) and inputs.

The SegformerForSemanticSegmentation forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.


from transformers import AutoImageProcessor, SegformerForSemanticSegmentation from PIL import Image import requests

image_processor = AutoImageProcessor.from_pretrained("nvidia/segformer-b0-finetuned-ade-512-512") model = SegformerForSemanticSegmentation.from_pretrained("nvidia/segformer-b0-finetuned-ade-512-512")

url = "" image =, stream=True).raw)

inputs = image_processor(images=image, return_tensors="pt") outputs = model(**inputs) logits = outputs.logits
list(logits.shape) [1, 150, 128, 128]


class transformers.TFSegformerDecodeHead

< source >

( config: SegformerConfig **kwargs )


< source >

( encoder_hidden_states: tf.Tensor training: bool = False )


class transformers.TFSegformerModel

< source >

( config: SegformerConfig *inputs **kwargs )


The bare SegFormer encoder (Mix-Transformer) outputting raw hidden-states without any specific head on top. This model inherits from TFPreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a keras.Model subclass. Use it as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to general usage and behavior.


< source >

( pixel_values: tf.Tensor output_attentions: Optional[bool] = None output_hidden_states: Optional[bool] = None return_dict: Optional[bool] = None training: bool = False ) → transformers.modeling_tf_outputs.TFBaseModelOutput or tuple(tf.Tensor)


A transformers.modeling_tf_outputs.TFBaseModelOutput or a tuple of tf.Tensor (ifreturn_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (SegformerConfig) and inputs.

The TFSegformerModel forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.


from transformers import AutoImageProcessor, TFSegformerModel from datasets import load_dataset

dataset = load_dataset("huggingface/cats-image", trust_remote_code=True) image = dataset["test"]["image"][0]

image_processor = AutoImageProcessor.from_pretrained("nvidia/mit-b0") model = TFSegformerModel.from_pretrained("nvidia/mit-b0")

inputs = image_processor(image, return_tensors="tf") outputs = model(**inputs)

last_hidden_states = outputs.last_hidden_state list(last_hidden_states.shape) [1, 256, 16, 16]


class transformers.TFSegformerForImageClassification

< source >

( config: SegformerConfig *inputs **kwargs )


SegFormer Model transformer with an image classification head on top (a linear layer on top of the final hidden states) e.g. for ImageNet.

This model inherits from TFPreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a keras.Model subclass. Use it as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to general usage and behavior.


< source >

( pixel_values: tf.Tensor | None = None labels: tf.Tensor | None = None output_attentions: Optional[bool] = None output_hidden_states: Optional[bool] = None return_dict: Optional[bool] = None ) → transformers.modeling_tf_outputs.TFSequenceClassifierOutput or tuple(tf.Tensor)


A transformers.modeling_tf_outputs.TFSequenceClassifierOutput or a tuple of tf.Tensor (ifreturn_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (SegformerConfig) and inputs.

The TFSegformerForImageClassification forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.


from transformers import AutoImageProcessor, TFSegformerForImageClassification import tensorflow as tf from datasets import load_dataset

dataset = load_dataset("huggingface/cats-image", trust_remote_code=True) image = dataset["test"]["image"][0]

image_processor = AutoImageProcessor.from_pretrained("nvidia/mit-b0") model = TFSegformerForImageClassification.from_pretrained("nvidia/mit-b0")

inputs = image_processor(image, return_tensors="tf") logits = model(**inputs).logits

predicted_label = int(tf.math.argmax(logits, axis=-1)) print(model.config.id2label[predicted_label]) tabby, tabby cat


class transformers.TFSegformerForSemanticSegmentation

< source >

( config: SegformerConfig **kwargs )


SegFormer Model transformer with an all-MLP decode head on top e.g. for ADE20k, CityScapes. This model inherits from TFPreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a keras.Model subclass. Use it as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to general usage and behavior.


< source >

( pixel_values: tf.Tensor labels: tf.Tensor | None = None output_attentions: Optional[bool] = None output_hidden_states: Optional[bool] = None return_dict: Optional[bool] = None ) → transformers.modeling_tf_outputs.TFSemanticSegmenterOutput or tuple(tf.Tensor)



transformers.modeling_tf_outputs.TFSemanticSegmenterOutput or tuple(tf.Tensor)

A transformers.modeling_tf_outputs.TFSemanticSegmenterOutput or a tuple of tf.Tensor (ifreturn_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (SegformerConfig) and inputs.

The TFSegformerForSemanticSegmentation forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.


from transformers import AutoImageProcessor, TFSegformerForSemanticSegmentation from PIL import Image import requests

url = "" image =, stream=True).raw)

image_processor = AutoImageProcessor.from_pretrained("nvidia/segformer-b0-finetuned-ade-512-512") model = TFSegformerForSemanticSegmentation.from_pretrained("nvidia/segformer-b0-finetuned-ade-512-512")

inputs = image_processor(images=image, return_tensors="tf") outputs = model(**inputs, training=False)

logits = outputs.logits list(logits.shape) [1, 150, 128, 128]

< > Update on GitHub