Mask2Former (original) (raw)

PyTorch

Overview

The Mask2Former model was proposed in Masked-attention Mask Transformer for Universal Image Segmentation by Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexander Kirillov, Rohit Girdhar. Mask2Former is a unified framework for panoptic, instance and semantic segmentation and features significant performance and efficiency improvements over MaskFormer.

The abstract from the paper is the following:

Image segmentation groups pixels with different semantics, e.g., category or instance membership. Each choice of semantics defines a task. While only the semantics of each task differ, current research focuses on designing specialized architectures for each task. We present Masked-attention Mask Transformer (Mask2Former), a new architecture capable of addressing any image segmentation task (panoptic, instance or semantic). Its key components include masked attention, which extracts localized features by constraining cross-attention within predicted mask regions. In addition to reducing the research effort by at least three times, it outperforms the best specialized architectures by a significant margin on four popular datasets. Most notably, Mask2Former sets a new state-of-the-art for panoptic segmentation (57.8 PQ on COCO), instance segmentation (50.1 AP on COCO) and semantic segmentation (57.7 mIoU on ADE20K).

drawing Mask2Former architecture. Taken from the original paper.

This model was contributed by Shivalika Singh and Alara Dirik. The original code can be found here.

Usage tips

Resources

A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with Mask2Former.

If you’re interested in submitting a resource to be included here, please feel free to open a Pull Request and we will review it. The resource should ideally demonstrate something new instead of duplicating an existing resource.

Mask2FormerConfig

class transformers.Mask2FormerConfig

< source >

( backbone_config: typing.Optional[typing.Dict] = None feature_size: int = 256 mask_feature_size: int = 256 hidden_dim: int = 256 encoder_feedforward_dim: int = 1024 activation_function: str = 'relu' encoder_layers: int = 6 decoder_layers: int = 10 num_attention_heads: int = 8 dropout: float = 0.0 dim_feedforward: int = 2048 pre_norm: bool = False enforce_input_projection: bool = False common_stride: int = 4 ignore_value: int = 255 num_queries: int = 100 no_object_weight: float = 0.1 class_weight: float = 2.0 mask_weight: float = 5.0 dice_weight: float = 5.0 train_num_points: int = 12544 oversample_ratio: float = 3.0 importance_sample_ratio: float = 0.75 init_std: float = 0.02 init_xavier_std: float = 1.0 use_auxiliary_loss: bool = True feature_strides: typing.List[int] = [4, 8, 16, 32] output_auxiliary_logits: typing.Optional[bool] = None backbone: typing.Optional[str] = None use_pretrained_backbone: bool = False use_timm_backbone: bool = False backbone_kwargs: typing.Optional[typing.Dict] = None **kwargs )

Parameters

This is the configuration class to store the configuration of a Mask2FormerModel. It is used to instantiate a Mask2Former model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the Mask2Formerfacebook/mask2former-swin-small-coco-instancearchitecture.

Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.

Currently, Mask2Former only supports the Swin Transformer as backbone.

Examples:

from transformers import Mask2FormerConfig, Mask2FormerModel

configuration = Mask2FormerConfig()

model = Mask2FormerModel(configuration)

configuration = model.config

Instantiate a Mask2FormerConfig (or a derived class) from a pre-trained backbone model configuration.

MaskFormer specific outputs

class transformers.models.mask2former.modeling_mask2former.Mask2FormerModelOutput

< source >

( encoder_last_hidden_state: typing.Optional[torch.FloatTensor] = None pixel_decoder_last_hidden_state: typing.Optional[torch.FloatTensor] = None transformer_decoder_last_hidden_state: typing.Optional[torch.FloatTensor] = None encoder_hidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = None pixel_decoder_hidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = None transformer_decoder_hidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = None transformer_decoder_intermediate_states: typing.Tuple[torch.FloatTensor] = None masks_queries_logits: typing.Tuple[torch.FloatTensor] = None attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None )

Parameters

Class for outputs of Mask2FormerModel. This class returns all the needed hidden states to compute the logits.

class transformers.models.mask2former.modeling_mask2former.Mask2FormerForUniversalSegmentationOutput

< source >

( loss: typing.Optional[torch.FloatTensor] = None class_queries_logits: typing.Optional[torch.FloatTensor] = None masks_queries_logits: typing.Optional[torch.FloatTensor] = None auxiliary_logits: typing.Optional[typing.List[typing.Dict[str, torch.FloatTensor]]] = None encoder_last_hidden_state: typing.Optional[torch.FloatTensor] = None pixel_decoder_last_hidden_state: typing.Optional[torch.FloatTensor] = None transformer_decoder_last_hidden_state: typing.Optional[torch.FloatTensor] = None encoder_hidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = None pixel_decoder_hidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = None transformer_decoder_hidden_states: typing.Optional[torch.FloatTensor] = None attentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None )

Parameters

Class for outputs of Mask2FormerForUniversalSegmentationOutput.

This output can be directly passed to post_process_semantic_segmentation() orpost_process_instance_segmentation() orpost_process_panoptic_segmentation() to compute final segmentation maps. Please, see [`~Mask2FormerImageProcessor] for details regarding usage.

Mask2FormerModel

class transformers.Mask2FormerModel

< source >

( config: Mask2FormerConfig )

Parameters

The bare Mask2Former Model outputting raw hidden-states without any specific head on top.

This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward

< source >

( pixel_values: Tensor pixel_mask: typing.Optional[torch.Tensor] = None output_hidden_states: typing.Optional[bool] = None output_attentions: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None ) β†’ transformers.models.mask2former.modeling_mask2former.Mask2FormerModelOutput or tuple(torch.FloatTensor)

Parameters

A transformers.models.mask2former.modeling_mask2former.Mask2FormerModelOutput or a tuple oftorch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (Mask2FormerConfig) and inputs.

The Mask2FormerModel forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Mask2FormerForUniversalSegmentation

class transformers.Mask2FormerForUniversalSegmentation

< source >

( config: Mask2FormerConfig )

Parameters

The Mask2Former Model with heads on top for instance/semantic/panoptic segmentation.

This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward

< source >

( pixel_values: Tensor mask_labels: typing.Optional[typing.List[torch.Tensor]] = None class_labels: typing.Optional[typing.List[torch.Tensor]] = None pixel_mask: typing.Optional[torch.Tensor] = None output_hidden_states: typing.Optional[bool] = None output_auxiliary_logits: typing.Optional[bool] = None output_attentions: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None ) β†’ transformers.models.mask2former.modeling_mask2former.Mask2FormerForUniversalSegmentationOutput or tuple(torch.FloatTensor)

Parameters

A transformers.models.mask2former.modeling_mask2former.Mask2FormerForUniversalSegmentationOutput or a tuple oftorch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (Mask2FormerConfig) and inputs.

The Mask2FormerForUniversalSegmentation forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Examples:

Instance segmentation example:

from transformers import AutoImageProcessor, Mask2FormerForUniversalSegmentation from PIL import Image import requests import torch

image_processor = AutoImageProcessor.from_pretrained("facebook/mask2former-swin-small-coco-instance") model = Mask2FormerForUniversalSegmentation.from_pretrained( ... "facebook/mask2former-swin-small-coco-instance" ... )

url = "http://images.cocodataset.org/val2017/000000039769.jpg" image = Image.open(requests.get(url, stream=True).raw) inputs = image_processor(image, return_tensors="pt")

with torch.no_grad(): ... outputs = model(**inputs)

class_queries_logits = outputs.class_queries_logits masks_queries_logits = outputs.masks_queries_logits

pred_instance_map = image_processor.post_process_instance_segmentation( ... outputs, target_sizes=[(image.height, image.width)] ... )[0] print(pred_instance_map.shape) torch.Size([480, 640])

Semantic segmentation example:

from transformers import AutoImageProcessor, Mask2FormerForUniversalSegmentation from PIL import Image import requests import torch

image_processor = AutoImageProcessor.from_pretrained("facebook/mask2former-swin-small-ade-semantic") model = Mask2FormerForUniversalSegmentation.from_pretrained("facebook/mask2former-swin-small-ade-semantic")

url = ( ... "https://huggingface.co/datasets/hf-internal-testing/fixtures_ade20k/resolve/main/ADE_val_00000001.jpg" ... ) image = Image.open(requests.get(url, stream=True).raw) inputs = image_processor(image, return_tensors="pt")

with torch.no_grad(): ... outputs = model(**inputs)

class_queries_logits = outputs.class_queries_logits masks_queries_logits = outputs.masks_queries_logits

pred_semantic_map = image_processor.post_process_semantic_segmentation( ... outputs, target_sizes=[(image.height, image.width)] ... )[0] print(pred_semantic_map.shape) torch.Size([512, 683])

Panoptic segmentation example:

from transformers import AutoImageProcessor, Mask2FormerForUniversalSegmentation from PIL import Image import requests import torch

image_processor = AutoImageProcessor.from_pretrained("facebook/mask2former-swin-small-cityscapes-panoptic") model = Mask2FormerForUniversalSegmentation.from_pretrained( ... "facebook/mask2former-swin-small-cityscapes-panoptic" ... )

url = "https://cdn-media.huggingface.co/Inference-API/Sample-results-on-the-Cityscapes-dataset-The-above-images-show-how-our-method-can-handle.png" image = Image.open(requests.get(url, stream=True).raw) inputs = image_processor(image, return_tensors="pt")

with torch.no_grad(): ... outputs = model(**inputs)

class_queries_logits = outputs.class_queries_logits masks_queries_logits = outputs.masks_queries_logits

pred_panoptic_map = image_processor.post_process_panoptic_segmentation( ... outputs, target_sizes=[(image.height, image.width)] ... )[0]["segmentation"] print(pred_panoptic_map.shape) torch.Size([338, 676])

Mask2FormerImageProcessor

class transformers.Mask2FormerImageProcessor

< source >

( do_resize: bool = True size: typing.Optional[typing.Dict[str, int]] = None size_divisor: int = 32 resample: Resampling = <Resampling.BILINEAR: 2> do_rescale: bool = True rescale_factor: float = 0.00392156862745098 do_normalize: bool = True image_mean: typing.Union[float, typing.List[float], NoneType] = None image_std: typing.Union[float, typing.List[float], NoneType] = None ignore_index: typing.Optional[int] = None do_reduce_labels: bool = False num_labels: typing.Optional[int] = None **kwargs )

Parameters

Constructs a Mask2Former image processor. The image processor can be used to prepare image(s) and optional targets for the model.

This image processor inherits from BaseImageProcessor which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.

preprocess

< source >

( images: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']] segmentation_maps: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor'], NoneType] = None instance_id_to_semantic_id: typing.Optional[typing.Dict[int, int]] = None do_resize: typing.Optional[bool] = None size: typing.Optional[typing.Dict[str, int]] = None size_divisor: typing.Optional[int] = None resample: Resampling = None do_rescale: typing.Optional[bool] = None rescale_factor: typing.Optional[float] = None do_normalize: typing.Optional[bool] = None image_mean: typing.Union[float, typing.List[float], NoneType] = None image_std: typing.Union[float, typing.List[float], NoneType] = None ignore_index: typing.Optional[int] = None do_reduce_labels: typing.Optional[bool] = None return_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = None data_format: typing.Union[str, transformers.image_utils.ChannelDimension] = <ChannelDimension.FIRST: 'channels_first'> input_data_format: typing.Union[str, transformers.image_utils.ChannelDimension, NoneType] = None )

encode_inputs

< source >

( pixel_values_list: typing.List[typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']]] segmentation_maps: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']] = None instance_id_to_semantic_id: typing.Union[typing.List[typing.Dict[int, int]], typing.Dict[int, int], NoneType] = None ignore_index: typing.Optional[int] = None do_reduce_labels: bool = False return_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = None input_data_format: typing.Union[str, transformers.image_utils.ChannelDimension, NoneType] = None ) β†’ BatchFeature

Parameters

A BatchFeature with the following fields:

Pad images up to the largest image in a batch and create a corresponding pixel_mask.

Mask2Former addresses semantic segmentation with a mask classification paradigm, thus input segmentation maps will be converted to lists of binary masks and their respective labels. Let’s see an example, assumingsegmentation_maps = [[2,6,7,9]], the output will contain mask_labels = [[1,0,0,0],[0,1,0,0],[0,0,1,0],[0,0,0,1]] (four binary masks) and class_labels = [2,6,7,9], the labels for each mask.

post_process_semantic_segmentation

< source >

( outputs target_sizes: typing.Optional[typing.List[typing.Tuple[int, int]]] = None ) β†’ List[torch.Tensor]

Parameters

Returns

List[torch.Tensor]

A list of length batch_size, where each item is a semantic segmentation map of shape (height, width) corresponding to the target_sizes entry (if target_sizes is specified). Each entry of eachtorch.Tensor correspond to a semantic class id.

Converts the output of Mask2FormerForUniversalSegmentation into semantic segmentation maps. Only supports PyTorch.

post_process_instance_segmentation

< source >

( outputs threshold: float = 0.5 mask_threshold: float = 0.5 overlap_mask_area_threshold: float = 0.8 target_sizes: typing.Optional[typing.List[typing.Tuple[int, int]]] = None return_coco_annotation: typing.Optional[bool] = False return_binary_maps: typing.Optional[bool] = False ) β†’ List[Dict]

Parameters

A list of dictionaries, one per image, each dictionary containing two keys:

Converts the output of Mask2FormerForUniversalSegmentationOutput into instance segmentation predictions. Only supports PyTorch. If instances could overlap, set either return_coco_annotation or return_binary_maps to True to get the correct segmentation result.

post_process_panoptic_segmentation

< source >

( outputs threshold: float = 0.5 mask_threshold: float = 0.5 overlap_mask_area_threshold: float = 0.8 label_ids_to_fuse: typing.Optional[typing.Set[int]] = None target_sizes: typing.Optional[typing.List[typing.Tuple[int, int]]] = None ) β†’ List[Dict]

Parameters

A list of dictionaries, one per image, each dictionary containing two keys:

Converts the output of Mask2FormerForUniversalSegmentationOutput into image panoptic segmentation predictions. Only supports PyTorch.

< > Update on GitHub