SAM (original) (raw)

Overview

SAM (Segment Anything Model) was proposed in Segment Anything by Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alex Berg, Wan-Yen Lo, Piotr Dollar, Ross Girshick.

The model can be used to predict segmentation masks of any object of interest given an input image.

example image

The abstract from the paper is the following:

We introduce the Segment Anything (SA) project: a new task, model, and dataset for image segmentation. Using our efficient model in a data collection loop, we built the largest segmentation dataset to date (by far), with over 1 billion masks on 11M licensed and privacy respecting images. The model is designed and trained to be promptable, so it can transfer zero-shot to new image distributions and tasks. We evaluate its capabilities on numerous tasks and find that its zero-shot performance is impressive β€” often competitive with or even superior to prior fully supervised results. We are releasing the Segment Anything Model (SAM) and corresponding dataset (SA-1B) of 1B masks and 11M images at https://segment-anything.com to foster research into foundation models for computer vision.

Tips:

This model was contributed by ybelkada and ArthurZ. The original code can be found here.

Below is an example on how to run mask generation given an image and a 2D point:

import torch from PIL import Image import requests from transformers import SamModel, SamProcessor

device = "cuda" if torch.cuda.is_available() else "cpu" model = SamModel.from_pretrained("facebook/sam-vit-huge").to(device) processor = SamProcessor.from_pretrained("facebook/sam-vit-huge")

img_url = "https://huggingface.co/ybelkada/segment-anything/resolve/main/assets/car.png" raw_image = Image.open(requests.get(img_url, stream=True).raw).convert("RGB") input_points = [[[450, 600]]]

inputs = processor(raw_image, input_points=input_points, return_tensors="pt").to(device) with torch.no_grad(): outputs = model(**inputs)

masks = processor.image_processor.post_process_masks( outputs.pred_masks.cpu(), inputs["original_sizes"].cpu(), inputs["reshaped_input_sizes"].cpu() ) scores = outputs.iou_scores

You can also process your own masks alongside the input images in the processor to be passed to the model.

import torch from PIL import Image import requests from transformers import SamModel, SamProcessor

device = "cuda" if torch.cuda.is_available() else "cpu" model = SamModel.from_pretrained("facebook/sam-vit-huge").to(device) processor = SamProcessor.from_pretrained("facebook/sam-vit-huge")

img_url = "https://huggingface.co/ybelkada/segment-anything/resolve/main/assets/car.png" raw_image = Image.open(requests.get(img_url, stream=True).raw).convert("RGB") mask_url = "https://huggingface.co/ybelkada/segment-anything/resolve/main/assets/car.png" segmentation_map = Image.open(requests.get(mask_url, stream=True).raw).convert("1") input_points = [[[450, 600]]]

inputs = processor(raw_image, input_points=input_points, segmentation_maps=segmentation_map, return_tensors="pt").to(device) with torch.no_grad(): outputs = model(**inputs)

masks = processor.image_processor.post_process_masks( outputs.pred_masks.cpu(), inputs["original_sizes"].cpu(), inputs["reshaped_input_sizes"].cpu() ) scores = outputs.iou_scores

Resources

A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with SAM.

SlimSAM

SlimSAM, a pruned version of SAM, was proposed in 0.1% Data Makes Segment Anything Slim by Zigeng Chen et al. SlimSAM reduces the size of the SAM models considerably while maintaining the same performance.

Checkpoints can be found on the hub, and they can be used as a drop-in replacement of SAM.

Grounded SAM

One can combine Grounding DINO with SAM for text-based mask generation as introduced in Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks. You can refer to this demo notebook 🌍 for details.

drawing Grounded SAM overview. Taken from the original repository.

SamConfig

class transformers.SamConfig

< source >

( vision_config = None prompt_encoder_config = None mask_decoder_config = None initializer_range = 0.02 **kwargs )

Parameters

SamConfig is the configuration class to store the configuration of a SamModel. It is used to instantiate a SAM model according to the specified arguments, defining the vision model, prompt-encoder model and mask decoder configs. Instantiating a configuration with the defaults will yield a similar configuration to that of the SAM-ViT-H facebook/sam-vit-huge architecture.

Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.

Example:

from transformers import ( ... SamVisionConfig, ... SamPromptEncoderConfig, ... SamMaskDecoderConfig, ... SamModel, ... )

configuration = SamConfig()

model = SamModel(configuration)

configuration = model.config

vision_config = SamVisionConfig() prompt_encoder_config = SamPromptEncoderConfig() mask_decoder_config = SamMaskDecoderConfig()

config = SamConfig(vision_config, prompt_encoder_config, mask_decoder_config)

SamVisionConfig

class transformers.SamVisionConfig

< source >

( hidden_size = 768 output_channels = 256 num_hidden_layers = 12 num_attention_heads = 12 num_channels = 3 image_size = 1024 patch_size = 16 hidden_act = 'gelu' layer_norm_eps = 1e-06 attention_dropout = 0.0 initializer_range = 1e-10 qkv_bias = True mlp_ratio = 4.0 use_abs_pos = True use_rel_pos = True window_size = 14 global_attn_indexes = [2, 5, 8, 11] num_pos_feats = 128 mlp_dim = None **kwargs )

Parameters

This is the configuration class to store the configuration of a SamVisionModel. It is used to instantiate a SAM vision encoder according to the specified arguments, defining the model architecture. Instantiating a configuration defaults will yield a similar configuration to that of the SAM ViT-hfacebook/sam-vit-huge architecture.

Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.

SamMaskDecoderConfig

class transformers.SamMaskDecoderConfig

< source >

( hidden_size = 256 hidden_act = 'relu' mlp_dim = 2048 num_hidden_layers = 2 num_attention_heads = 8 attention_downsample_rate = 2 num_multimask_outputs = 3 iou_head_depth = 3 iou_head_hidden_dim = 256 layer_norm_eps = 1e-06 **kwargs )

Parameters

This is the configuration class to store the configuration of a SamMaskDecoder. It is used to instantiate a SAM mask decoder to the specified arguments, defining the model architecture. Instantiating a configuration defaults will yield a similar configuration to that of the SAM-vit-hfacebook/sam-vit-huge architecture.

Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.

SamPromptEncoderConfig

class transformers.SamPromptEncoderConfig

< source >

( hidden_size = 256 image_size = 1024 patch_size = 16 mask_input_channels = 16 num_point_embeddings = 4 hidden_act = 'gelu' layer_norm_eps = 1e-06 **kwargs )

Parameters

This is the configuration class to store the configuration of a SamPromptEncoder. The SamPromptEncodermodule is used to encode the input 2D points and bounding boxes. Instantiating a configuration defaults will yield a similar configuration to that of the SAM-vit-hfacebook/sam-vit-huge architecture.

Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.

SamProcessor

class transformers.SamProcessor

< source >

( image_processor )

Parameters

Constructs a SAM processor which wraps a SAM image processor and an 2D points & Bounding boxes processor into a single processor.

SamProcessor offers all the functionalities of SamImageProcessor. See the docstring ofcall() for more information.

SamImageProcessor

class transformers.SamImageProcessor

< source >

( do_resize: bool = True size: Dict = None mask_size: Dict = None resample: Resampling = <Resampling.BILINEAR: 2> do_rescale: bool = True rescale_factor: Union = 0.00392156862745098 do_normalize: bool = True image_mean: Union = None image_std: Union = None do_pad: bool = True pad_size: int = None mask_pad_size: int = None do_convert_rgb: bool = True **kwargs )

Parameters

Constructs a SAM image processor.

filter_masks

< source >

( masks iou_scores original_size cropped_box_image pred_iou_thresh = 0.88 stability_score_thresh = 0.95 mask_threshold = 0 stability_score_offset = 1 return_tensors = 'pt' )

Parameters

Filters the predicted masks by selecting only the ones that meets several criteria. The first criterion being that the iou scores needs to be greater than pred_iou_thresh. The second criterion is that the stability score needs to be greater than stability_score_thresh. The method also converts the predicted masks to bounding boxes and pad the predicted masks if necessary.

generate_crop_boxes

< source >

( image target_size crop_n_layers: int = 0 overlap_ratio: float = 0.3413333333333333 points_per_crop: Optional = 32 crop_n_points_downscale_factor: Optional = 1 device: Optional = None input_data_format: Union = None return_tensors: str = 'pt' )

Parameters

Generates a list of crop boxes of different sizes. Each layer has (2**i)**2 boxes for the ith layer.

pad_image

< source >

( image: ndarray pad_size: Dict data_format: Union = None input_data_format: Union = None **kwargs )

Parameters

Pad an image to (pad_size["height"], pad_size["width"]) with zeros to the right and bottom.

post_process_for_mask_generation

< source >

( all_masks all_scores all_boxes crops_nms_thresh return_tensors = 'pt' )

Parameters

Post processes mask that are generated by calling the Non Maximum Suppression algorithm on the predicted masks.

post_process_masks

< source >

( masks original_sizes reshaped_input_sizes mask_threshold = 0.0 binarize = True pad_size = None return_tensors = 'pt' ) β†’ (Union[torch.Tensor, tf.Tensor])

Parameters

Returns

(Union[torch.Tensor, tf.Tensor])

Batched masks in batch_size, num_channels, height, width) format, where (height, width) is given by original_size.

Remove padding and upscale masks to the original image size.

preprocess

< source >

( images: Union segmentation_maps: Union = None do_resize: Optional = None size: Optional = None mask_size: Optional = None resample: Optional = None do_rescale: Optional = None rescale_factor: Union = None do_normalize: Optional = None image_mean: Union = None image_std: Union = None do_pad: Optional = None pad_size: Optional = None mask_pad_size: Optional = None do_convert_rgb: Optional = None return_tensors: Union = None data_format: ChannelDimension = <ChannelDimension.FIRST: 'channels_first'> input_data_format: Union = None )

Parameters

Preprocess an image or batch of images.

resize

< source >

( image: ndarray size: Dict resample: Resampling = <Resampling.BICUBIC: 3> data_format: Union = None input_data_format: Union = None **kwargs ) β†’ np.ndarray

Parameters

The resized image.

Resize an image to (size["height"], size["width"]).

SamModel

class transformers.SamModel

< source >

( config )

Parameters

Segment Anything Model (SAM) for generating segmentation masks, given an input image and optional 2D location and bounding boxes. This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward

< source >

( pixel_values: Optional = None input_points: Optional = None input_labels: Optional = None input_boxes: Optional = None input_masks: Optional = None image_embeddings: Optional = None multimask_output: bool = True attention_similarity: Optional = None target_embedding: Optional = None output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None **kwargs )

Parameters

The SamModel forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

TFSamModel

class transformers.TFSamModel

< source >

( config **kwargs )

Parameters

Segment Anything Model (SAM) for generating segmentation masks, given an input image and optional 2D location and bounding boxes. This model inherits from TFPreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a TensorFlow keras.Modelsubclass. Use it as a regular TensorFlow Model and refer to the TensorFlow documentation for all matter related to general usage and behavior.

call

< source >

( pixel_values: TFModelInputType | None = None input_points: tf.Tensor | None = None input_labels: tf.Tensor | None = None input_boxes: tf.Tensor | None = None input_masks: tf.Tensor | None = None image_embeddings: tf.Tensor | None = None multimask_output: bool = True output_attentions: bool | None = None output_hidden_states: bool | None = None return_dict: bool | None = None training: bool = False **kwargs )

Parameters

The TFSamModel forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

< > Update on GitHub