OneFormer · Hugging Face (original) (raw)

This model was released on 2022-11-10 and added to Hugging Face Transformers on 2023-01-19.

PyTorch

Overview

The OneFormer model was proposed in OneFormer: One Transformer to Rule Universal Image Segmentation by Jitesh Jain, Jiachen Li, MangTik Chiu, Ali Hassani, Nikita Orlov, Humphrey Shi. OneFormer is a universal image segmentation framework that can be trained on a single panoptic dataset to perform semantic, instance, and panoptic segmentation tasks. OneFormer uses a task token to condition the model on the task in focus, making the architecture task-guided for training, and task-dynamic for inference.

The abstract from the paper is the following:

Universal Image Segmentation is not a new concept. Past attempts to unify image segmentation in the last decades include scene parsing, panoptic segmentation, and, more recently, new panoptic architectures. However, such panoptic architectures do not truly unify image segmentation because they need to be trained individually on the semantic, instance, or panoptic segmentation to achieve the best performance. Ideally, a truly universal framework should be trained only once and achieve SOTA performance across all three image segmentation tasks. To that end, we propose OneFormer, a universal image segmentation framework that unifies segmentation with a multi-task train-once design. We first propose a task-conditioned joint training strategy that enables training on ground truths of each domain (semantic, instance, and panoptic segmentation) within a single multi-task training process. Secondly, we introduce a task token to condition our model on the task at hand, making our model task-dynamic to support multi-task training and inference. Thirdly, we propose using a query-text contrastive loss during training to establish better inter-task and inter-class distinctions. Notably, our single OneFormer model outperforms specialized Mask2Former models across all three segmentation tasks on ADE20k, CityScapes, and COCO, despite the latter being trained on each of the three tasks individually with three times the resources. With new ConvNeXt and DiNAT backbones, we observe even more performance improvement. We believe OneFormer is a significant step towards making image segmentation more universal and accessible.

The figure below illustrates the architecture of OneFormer. Taken from the original paper.

This model was contributed by Jitesh Jain. The original code can be found here.

Usage tips

Resources

A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with OneFormer.

If you’re interested in submitting a resource to be included here, please feel free to open a Pull Request and we will review it. The resource should ideally demonstrate something new instead of duplicating an existing resource.

OneFormer specific outputs

class transformers.models.oneformer.modeling_oneformer.OneFormerModelOutput

< source >

( encoder_hidden_states: tuple[torch.FloatTensor] | None = None pixel_decoder_hidden_states: tuple[torch.FloatTensor] | None = None transformer_decoder_hidden_states: torch.FloatTensor | None = None transformer_decoder_object_queries: torch.FloatTensor | None = None transformer_decoder_contrastive_queries: torch.FloatTensor | None = None transformer_decoder_mask_predictions: torch.FloatTensor | None = None transformer_decoder_class_predictions: torch.FloatTensor | None = None transformer_decoder_auxiliary_predictions: tuple[dict[str, torch.FloatTensor]] | None = None text_queries: torch.FloatTensor | None = None task_token: torch.FloatTensor | None = None attentions: tuple[torch.FloatTensor] | None = None )

Parameters

Class for outputs of OneFormerModel. This class returns all the needed hidden states to compute the logits.

class transformers.models.oneformer.modeling_oneformer.OneFormerForUniversalSegmentationOutput

< source >

( loss: torch.FloatTensor | None = None class_queries_logits: torch.FloatTensor | None = None masks_queries_logits: torch.FloatTensor | None = None auxiliary_predictions: list = None encoder_hidden_states: tuple[torch.FloatTensor] | None = None pixel_decoder_hidden_states: list[torch.FloatTensor] | None = None transformer_decoder_hidden_states: torch.FloatTensor | None = None transformer_decoder_object_queries: torch.FloatTensor | None = None transformer_decoder_contrastive_queries: torch.FloatTensor | None = None transformer_decoder_mask_predictions: torch.FloatTensor | None = None transformer_decoder_class_predictions: torch.FloatTensor | None = None transformer_decoder_auxiliary_predictions: list[dict[str, torch.FloatTensor]] | None = None text_queries: torch.FloatTensor | None = None task_token: torch.FloatTensor | None = None attentions: tuple[tuple[torch.FloatTensor]] | None = None )

Parameters

Class for outputs of OneFormerForUniversalSegmentationOutput.

This output can be directly passed to post_process_semantic_segmentation() orpost_process_instance_segmentation() orpost_process_panoptic_segmentation() depending on the task. Please, see [`~OneFormerImageProcessor] for details regarding usage.

OneFormerConfig

class transformers.OneFormerConfig

< source >

( transformers_version: str | None = None architectures: list[str] | None = None output_hidden_states: bool | None = False return_dict: bool | None = True dtype: typing.Union[str, ForwardRef('torch.dtype'), NoneType] = None chunk_size_feed_forward: int = 0 is_encoder_decoder: bool = False id2label: dict[int, str] | dict[str, str] | None = None label2id: dict[str, int] | dict[str, str] | None = None problem_type: typing.Optional[typing.Literal['regression', 'single_label_classification', 'multi_label_classification']] = None backbone_config: dict | transformers.configuration_utils.PreTrainedConfig | None = None ignore_value: int = 255 num_queries: int = 150 no_object_weight: float = 0.1 class_weight: float = 2.0 mask_weight: float = 5.0 dice_weight: float = 5.0 contrastive_weight: float = 0.5 contrastive_temperature: float = 0.07 train_num_points: int = 12544 oversample_ratio: float = 3.0 importance_sample_ratio: float = 0.75 init_std: float = 0.02 init_xavier_std: float = 1.0 layer_norm_eps: float = 1e-05 is_training: bool = False use_auxiliary_loss: bool = True output_auxiliary_logits: bool = True strides: list[int] | tuple[int, ...] = (4, 8, 16, 32) task_seq_len: int = 77 text_encoder_width: int = 256 text_encoder_context_length: int = 77 text_encoder_num_layers: int = 6 text_encoder_vocab_size: int = 49408 text_encoder_proj_layers: int = 2 text_encoder_n_ctx: int = 16 conv_dim: int = 256 mask_dim: int = 256 hidden_dim: int = 256 encoder_feedforward_dim: int = 1024 norm: str = 'GN' encoder_layers: int = 6 decoder_layers: int = 10 use_task_norm: bool = True num_attention_heads: int = 8 dropout: float | int = 0.1 dim_feedforward: int = 2048 pre_norm: bool = False enforce_input_proj: bool = False query_dec_layers: int = 2 common_stride: int = 4 )

Parameters

This is the configuration class to store the configuration of a OneFormerModel. It is used to instantiate a Oneformer model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the shi-labs/oneformer_ade20k_swin_tiny

Configuration objects inherit from PreTrainedConfig and can be used to control the model outputs. Read the documentation from PreTrainedConfig for more information.

Examples:

from transformers import OneFormerConfig, OneFormerModel

configuration = OneFormerConfig()

model = OneFormerModel(configuration)

configuration = model.config

OneFormerImageProcessor

class transformers.OneFormerImageProcessor

< source >

( **kwargs: typing_extensions.Unpack[transformers.models.oneformer.image_processing_oneformer.OneFormerImageProcessorKwargs] )

Parameters

Constructs a OneFormerImageProcessor image processor.

preprocess

< source >

( images: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']] task_inputs: list[str] | None = None segmentation_maps: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor'], NoneType] = None instance_id_to_semantic_id: list[dict[int, int]] | dict[int, int] | None = None **kwargs: typing_extensions.Unpack[transformers.models.oneformer.image_processing_oneformer.OneFormerImageProcessorKwargs] ) → ~image_processing_base.BatchFeature

Parameters

Returns

~image_processing_base.BatchFeature

post_process_semantic_segmentation

< source >

( outputs target_sizes: list[tuple[int, int]] | None = None ) → List[torch.Tensor]

Parameters

Returns

List[torch.Tensor]

A list of length batch_size, where each item is a semantic segmentation map of shape (height, width) corresponding to the target_sizes entry (if target_sizes is specified). Each entry of eachtorch.Tensor correspond to a semantic class id.

Converts the output of MaskFormerForInstanceSegmentation into semantic segmentation maps. Only supports PyTorch.

post_process_instance_segmentation

< source >

( outputs task_type: str = 'instance' is_demo: bool = True threshold: float = 0.5 mask_threshold: float = 0.5 overlap_mask_area_threshold: float = 0.8 target_sizes: list[tuple[int, int]] | None = None return_coco_annotation: bool | None = False ) → List[Dict]

Parameters

A list of dictionaries, one per image, each dictionary containing two keys:

Converts the output of OneFormerForUniversalSegmentationOutput into image instance segmentation predictions. Only supports PyTorch.

post_process_panoptic_segmentation

< source >

( outputs threshold: float = 0.5 mask_threshold: float = 0.5 overlap_mask_area_threshold: float = 0.8 label_ids_to_fuse: set[int] | None = None target_sizes: list[tuple[int, int]] | None = None ) → list[Dict]

Parameters

A list of dictionaries, one per image, each dictionary containing two keys:

Converts the output of MaskFormerForInstanceSegmentationOutput into image panoptic segmentation predictions. Only supports PyTorch.

OneFormerImageProcessorPil

class transformers.OneFormerImageProcessorPil

< source >

( **kwargs: typing_extensions.Unpack[transformers.models.oneformer.image_processing_pil_oneformer.OneFormerImageProcessorKwargs] )

Parameters

Constructs a OneFormerImageProcessor image processor.

preprocess

< source >

( images: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']] task_inputs: list[str] | None = None segmentation_maps: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor'], NoneType] = None instance_id_to_semantic_id: list[dict[int, int]] | dict[int, int] | None = None **kwargs: typing_extensions.Unpack[transformers.models.oneformer.image_processing_pil_oneformer.OneFormerImageProcessorKwargs] ) → ~image_processing_base.BatchFeature

Parameters

Returns

~image_processing_base.BatchFeature

post_process_semantic_segmentation

< source >

( outputs target_sizes: list[tuple[int, int]] | None = None ) → List[torch.Tensor]

Parameters

Returns

List[torch.Tensor]

A list of length batch_size, where each item is a semantic segmentation map of shape (height, width) corresponding to the target_sizes entry (if target_sizes is specified). Each entry of eachtorch.Tensor correspond to a semantic class id.

Converts the output of MaskFormerForInstanceSegmentation into semantic segmentation maps. Only supports PyTorch.

post_process_instance_segmentation

< source >

( outputs task_type: str = 'instance' is_demo: bool = True threshold: float = 0.5 mask_threshold: float = 0.5 overlap_mask_area_threshold: float = 0.8 target_sizes: list[tuple[int, int]] | None = None return_coco_annotation: bool | None = False ) → List[Dict]

Parameters

A list of dictionaries, one per image, each dictionary containing two keys:

Converts the output of OneFormerForUniversalSegmentationOutput into image instance segmentation predictions. Only supports PyTorch.

post_process_panoptic_segmentation

< source >

( outputs threshold: float = 0.5 mask_threshold: float = 0.5 overlap_mask_area_threshold: float = 0.8 label_ids_to_fuse: set[int] | None = None target_sizes: list[tuple[int, int]] | None = None ) → list[Dict]

Parameters

A list of dictionaries, one per image, each dictionary containing two keys:

Converts the output of MaskFormerForInstanceSegmentationOutput into image panoptic segmentation predictions. Only supports PyTorch.

OneFormerProcessor

class transformers.OneFormerProcessor

< source >

( image_processor = None tokenizer = None max_seq_length: int = 77 task_seq_length: int = 77 **kwargs )

Parameters

Constructs a OneFormerProcessor which wraps a image processor and a tokenizer into a single processor.

OneFormerProcessor offers all the functionalities of OneFormerImageProcessor and CLIPTokenizer. See the~OneFormerImageProcessor and ~CLIPTokenizer for more information.

__call__

< source >

( images = None task_inputs = None segmentation_maps = None **kwargs ) → BatchFeature

Parameters

A BatchFeature with the following fields:

OneFormerModel

class transformers.OneFormerModel

< source >

( config: OneFormerConfig )

Parameters

The bare Oneformer Model outputting raw hidden-states without any specific head on top.

This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward

< source >

( pixel_values: Tensor task_inputs: Tensor text_inputs: torch.Tensor | None = None pixel_mask: torch.Tensor | None = None output_hidden_states: bool | None = None output_attentions: bool | None = None return_dict: bool | None = None **kwargs ) → OneFormerModelOutput or tuple(torch.FloatTensor)

Parameters

Returns

OneFormerModelOutput or tuple(torch.FloatTensor)

A OneFormerModelOutput or a tuple oftorch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (OneFormerConfig) and inputs.

The OneFormerModel forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Example:

import torch from PIL import Image import httpx from io import BytesIO from transformers import OneFormerProcessor, OneFormerModel

url = "http://images.cocodataset.org/val2017/000000039769.jpg" with httpx.stream("GET", url) as response: ... image = Image.open(BytesIO(response.read()))

processor = OneFormerProcessor.from_pretrained("shi-labs/oneformer_ade20k_swin_tiny") model = OneFormerModel.from_pretrained("shi-labs/oneformer_ade20k_swin_tiny") inputs = processor(image, ["semantic"], return_tensors="pt")

with torch.no_grad(): ... outputs = model(**inputs)

mask_predictions = outputs.transformer_decoder_mask_predictions class_predictions = outputs.transformer_decoder_class_predictions

f"👉 Mask Predictions Shape: {list(mask_predictions.shape)}, Class Predictions Shape: {list(class_predictions.shape)}" '👉 Mask Predictions Shape: [1, 150, 128, 171], Class Predictions Shape: [1, 150, 151]'

OneFormerForUniversalSegmentation

class transformers.OneFormerForUniversalSegmentation

< source >

( config: OneFormerConfig )

Parameters

OneFormer Model for instance, semantic and panoptic image segmentation.

This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward

< source >

( pixel_values: Tensor task_inputs: Tensor text_inputs: torch.Tensor | None = None mask_labels: list[torch.Tensor] | None = None class_labels: list[torch.Tensor] | None = None pixel_mask: torch.Tensor | None = None output_auxiliary_logits: bool | None = None output_hidden_states: bool | None = None output_attentions: bool | None = None return_dict: bool | None = None **kwargs ) → OneFormerForUniversalSegmentationOutput or tuple(torch.FloatTensor)

Parameters

A OneFormerForUniversalSegmentationOutput or a tuple oftorch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (OneFormerConfig) and inputs.

The OneFormerForUniversalSegmentation forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Example:

Universal segmentation example:

from transformers import OneFormerProcessor, OneFormerForUniversalSegmentation from PIL import Image import httpx from io import BytesIO import torch

processor = OneFormerProcessor.from_pretrained("shi-labs/oneformer_ade20k_swin_tiny") model = OneFormerForUniversalSegmentation.from_pretrained("shi-labs/oneformer_ade20k_swin_tiny")

url = ( ... "https://huggingface.co/datasets/hf-internal-testing/fixtures_ade20k/resolve/main/ADE_val_00000001.jpg" ... ) with httpx.stream("GET", url) as response: ... image = Image.open(BytesIO(response.read()))

inputs = processor(image, ["semantic"], return_tensors="pt")

with torch.no_grad(): ... outputs = model(**inputs)

class_queries_logits = outputs.class_queries_logits masks_queries_logits = outputs.masks_queries_logits

predicted_semantic_map = processor.post_process_semantic_segmentation( ... outputs, target_sizes=[(image.height, image.width)] ... )[0] f"👉 Semantic Predictions Shape: {list(predicted_semantic_map.shape)}" '👉 Semantic Predictions Shape: [512, 683]'

inputs = processor(image, ["instance"], return_tensors="pt")

with torch.no_grad(): ... outputs = model(**inputs)

class_queries_logits = outputs.class_queries_logits masks_queries_logits = outputs.masks_queries_logits

predicted_instance_map = processor.post_process_instance_segmentation( ... outputs, target_sizes=[(image.height, image.width)] ... )[0]["segmentation"] f"👉 Instance Predictions Shape: {list(predicted_instance_map.shape)}" '👉 Instance Predictions Shape: [512, 683]'

inputs = processor(image, ["panoptic"], return_tensors="pt")

with torch.no_grad(): ... outputs = model(**inputs)

class_queries_logits = outputs.class_queries_logits masks_queries_logits = outputs.masks_queries_logits

predicted_panoptic_map = processor.post_process_panoptic_segmentation( ... outputs, target_sizes=[(image.height, image.width)] ... )[0]["segmentation"] f"👉 Panoptic Predictions Shape: {list(predicted_panoptic_map.shape)}" '👉 Panoptic Predictions Shape: [512, 683]'

Update on GitHub