Stable Cascade (original) (raw)

This model is built upon the Würstchen architecture and its main difference to other models like Stable Diffusion is that it is working at a much smaller latent space. Why is this important? The smaller the latent space, the faster you can run inference and the cheaper the training becomes. How small is the latent space? Stable Diffusion uses a compression factor of 8, resulting in a 1024x1024 image being encoded to 128x128. Stable Cascade achieves a compression factor of 42, meaning that it is possible to encode a 1024x1024 image to 24x24, while maintaining crisp reconstructions. The text-conditional model is then trained in the highly compressed latent space. Previous versions of this architecture, achieved a 16x cost reduction over Stable Diffusion 1.5.

Therefore, this kind of model is well suited for usages where efficiency is important. Furthermore, all known extensions like finetuning, LoRA, ControlNet, IP-Adapter, LCM etc. are possible with this method as well.

The original codebase can be found at Stability-AI/StableCascade.

Model Overview

Stable Cascade consists of three models: Stage A, Stage B and Stage C, representing a cascade to generate images, hence the name “Stable Cascade”.

Stage A & B are used to compress images, similar to what the job of the VAE is in Stable Diffusion. However, with this setup, a much higher compression of images can be achieved. While the Stable Diffusion models use a spatial compression factor of 8, encoding an image with resolution of 1024 x 1024 to 128 x 128, Stable Cascade achieves a compression factor of 42. This encodes a 1024 x 1024 image to 24 x 24, while being able to accurately decode the image. This comes with the great benefit of cheaper training and inference. Furthermore, Stage C is responsible for generating the small 24 x 24 latents given a text prompt.

The Stage C model operates on the small 24 x 24 latents and denoises the latents conditioned on text prompts. The model is also the largest component in the Cascade pipeline and is meant to be used with the StableCascadePriorPipeline

The Stage B and Stage A models are used with the StableCascadeDecoderPipeline and are responsible for generating the final image given the small 24 x 24 latents.

There are some restrictions on data types that can be used with the Stable Cascade models. The official checkpoints for the StableCascadePriorPipeline do not support the torch.float16 data type. Please use torch.bfloat16 instead.

In order to use the torch.bfloat16 data type with the StableCascadeDecoderPipeline you need to have PyTorch 2.2.0 or higher installed. This also means that using the StableCascadeCombinedPipeline with torch.bfloat16 requires PyTorch 2.2.0 or higher, since it calls the StableCascadeDecoderPipeline internally.

If it is not possible to install PyTorch 2.2.0 or higher in your environment, the StableCascadeDecoderPipeline can be used on its own with the torch.float16 data type. You can download the full precision or bf16 variant weights for the pipeline and cast the weights to torch.float16.

Usage example

import torch from diffusers import StableCascadeDecoderPipeline, StableCascadePriorPipeline

prompt = "an image of a shiba inu, donning a spacesuit and helmet" negative_prompt = ""

prior = StableCascadePriorPipeline.from_pretrained("stabilityai/stable-cascade-prior", variant="bf16", torch_dtype=torch.bfloat16) decoder = StableCascadeDecoderPipeline.from_pretrained("stabilityai/stable-cascade", variant="bf16", torch_dtype=torch.float16)

prior.enable_model_cpu_offload() prior_output = prior( prompt=prompt, height=1024, width=1024, negative_prompt=negative_prompt, guidance_scale=4.0, num_images_per_prompt=1, num_inference_steps=20 )

decoder.enable_model_cpu_offload() decoder_output = decoder( image_embeddings=prior_output.image_embeddings.to(torch.float16), prompt=prompt, negative_prompt=negative_prompt, guidance_scale=0.0, output_type="pil", num_inference_steps=10 ).images[0] decoder_output.save("cascade.png")

Using the Lite Versions of the Stage B and Stage C models

import torch from diffusers import ( StableCascadeDecoderPipeline, StableCascadePriorPipeline, StableCascadeUNet, )

prompt = "an image of a shiba inu, donning a spacesuit and helmet" negative_prompt = ""

prior_unet = StableCascadeUNet.from_pretrained("stabilityai/stable-cascade-prior", subfolder="prior_lite") decoder_unet = StableCascadeUNet.from_pretrained("stabilityai/stable-cascade", subfolder="decoder_lite")

prior = StableCascadePriorPipeline.from_pretrained("stabilityai/stable-cascade-prior", prior=prior_unet) decoder = StableCascadeDecoderPipeline.from_pretrained("stabilityai/stable-cascade", decoder=decoder_unet)

prior.enable_model_cpu_offload() prior_output = prior( prompt=prompt, height=1024, width=1024, negative_prompt=negative_prompt, guidance_scale=4.0, num_images_per_prompt=1, num_inference_steps=20 )

decoder.enable_model_cpu_offload() decoder_output = decoder( image_embeddings=prior_output.image_embeddings, prompt=prompt, negative_prompt=negative_prompt, guidance_scale=0.0, output_type="pil", num_inference_steps=10 ).images[0] decoder_output.save("cascade.png")

Loading original checkpoints with from_single_file

Loading the original format checkpoints is supported via from_single_file method in the StableCascadeUNet.

import torch from diffusers import ( StableCascadeDecoderPipeline, StableCascadePriorPipeline, StableCascadeUNet, )

prompt = "an image of a shiba inu, donning a spacesuit and helmet" negative_prompt = ""

prior_unet = StableCascadeUNet.from_single_file( "https://huggingface.co/stabilityai/stable-cascade/resolve/main/stage_c_bf16.safetensors", torch_dtype=torch.bfloat16 ) decoder_unet = StableCascadeUNet.from_single_file( "https://huggingface.co/stabilityai/stable-cascade/blob/main/stage_b_bf16.safetensors", torch_dtype=torch.bfloat16 )

prior = StableCascadePriorPipeline.from_pretrained("stabilityai/stable-cascade-prior", prior=prior_unet, torch_dtype=torch.bfloat16) decoder = StableCascadeDecoderPipeline.from_pretrained("stabilityai/stable-cascade", decoder=decoder_unet, torch_dtype=torch.bfloat16)

prior.enable_model_cpu_offload() prior_output = prior( prompt=prompt, height=1024, width=1024, negative_prompt=negative_prompt, guidance_scale=4.0, num_images_per_prompt=1, num_inference_steps=20 )

decoder.enable_model_cpu_offload() decoder_output = decoder( image_embeddings=prior_output.image_embeddings, prompt=prompt, negative_prompt=negative_prompt, guidance_scale=0.0, output_type="pil", num_inference_steps=10 ).images[0] decoder_output.save("cascade-single-file.png")

Uses

Direct Use

The model is intended for research purposes for now. Possible research areas and tasks include

Excluded uses are described below.

Out-of-Scope Use

The model was not trained to be factual or true representations of people or events, and therefore using the model to generate such content is out-of-scope for the abilities of this model. The model should not be used in any way that violates Stability AI’s Acceptable Use Policy.

Limitations and Bias

Limitations

StableCascadeCombinedPipeline

class diffusers.StableCascadeCombinedPipeline

< source >

( tokenizer: CLIPTokenizer text_encoder: CLIPTextModelWithProjection decoder: StableCascadeUNet scheduler: DDPMWuerstchenScheduler vqgan: PaellaVQModel prior_prior: StableCascadeUNet prior_text_encoder: CLIPTextModelWithProjection prior_tokenizer: CLIPTokenizer prior_scheduler: DDPMWuerstchenScheduler prior_feature_extractor: typing.Optional[transformers.models.clip.image_processing_clip.CLIPImageProcessor] = None prior_image_encoder: typing.Optional[transformers.models.clip.modeling_clip.CLIPVisionModelWithProjection] = None )

Parameters

Combined Pipeline for text-to-image generation using Stable Cascade.

This model inherits from DiffusionPipeline. Check the superclass documentation for the generic methods the library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)

__call__

< source >

( prompt: typing.Union[str, typing.List[str], NoneType] = None images: typing.Union[torch.Tensor, PIL.Image.Image, typing.List[torch.Tensor], typing.List[PIL.Image.Image]] = None height: int = 512 width: int = 512 prior_num_inference_steps: int = 60 prior_guidance_scale: float = 4.0 num_inference_steps: int = 12 decoder_guidance_scale: float = 0.0 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None prompt_embeds: typing.Optional[torch.Tensor] = None prompt_embeds_pooled: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds_pooled: typing.Optional[torch.Tensor] = None num_images_per_prompt: int = 1 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True prior_callback_on_step_end: typing.Optional[typing.Callable[[int, int, typing.Dict], NoneType]] = None prior_callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] callback_on_step_end: typing.Optional[typing.Callable[[int, int, typing.Dict], NoneType]] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] )

Parameters

Function invoked when calling the pipeline for generation.

Examples:

import torch from diffusers import StableCascadeCombinedPipeline

pipe = StableCascadeCombinedPipeline.from_pretrained( ... "stabilityai/stable-cascade", variant="bf16", torch_dtype=torch.bfloat16 ... ) pipe.enable_model_cpu_offload() prompt = "an image of a shiba inu, donning a spacesuit and helmet" images = pipe(prompt=prompt)

enable_model_cpu_offload

< source >

( gpu_id: typing.Optional[int] = None device: typing.Union[torch.device, str] = 'cuda' )

Offloads all models to CPU using accelerate, reducing memory usage with a low impact on performance. Compared to enable_sequential_cpu_offload, this method moves one whole model at a time to the GPU when its forwardmethod is called, and the model remains in GPU until the next model runs. Memory savings are lower than withenable_sequential_cpu_offload, but performance is much better due to the iterative execution of the unet.

enable_sequential_cpu_offload

< source >

( gpu_id: typing.Optional[int] = None device: typing.Union[torch.device, str] = 'cuda' )

Offloads all models (unet, text_encoder, vae, and safety checker state dicts) to CPU using 🤗 Accelerate, significantly reducing memory usage. Models are moved to a torch.device('meta') and loaded on a GPU only when their specific submodule’s forward method is called. Offloading happens on a submodule basis. Memory savings are higher than using enable_model_cpu_offload, but performance is lower.

StableCascadePriorPipeline

class diffusers.StableCascadePriorPipeline

< source >

( tokenizer: CLIPTokenizer text_encoder: CLIPTextModelWithProjection prior: StableCascadeUNet scheduler: DDPMWuerstchenScheduler resolution_multiple: float = 42.67 feature_extractor: typing.Optional[transformers.models.clip.image_processing_clip.CLIPImageProcessor] = None image_encoder: typing.Optional[transformers.models.clip.modeling_clip.CLIPVisionModelWithProjection] = None )

Parameters

Pipeline for generating image prior for Stable Cascade.

This model inherits from DiffusionPipeline. Check the superclass documentation for the generic methods the library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)

__call__

< source >

( prompt: typing.Union[str, typing.List[str], NoneType] = None images: typing.Union[torch.Tensor, PIL.Image.Image, typing.List[torch.Tensor], typing.List[PIL.Image.Image]] = None height: int = 1024 width: int = 1024 num_inference_steps: int = 20 timesteps: typing.List[float] = None guidance_scale: float = 4.0 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None prompt_embeds: typing.Optional[torch.Tensor] = None prompt_embeds_pooled: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds_pooled: typing.Optional[torch.Tensor] = None image_embeds: typing.Optional[torch.Tensor] = None num_images_per_prompt: typing.Optional[int] = 1 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pt' return_dict: bool = True callback_on_step_end: typing.Optional[typing.Callable[[int, int, typing.Dict], NoneType]] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] )

Parameters

Function invoked when calling the pipeline for generation.

Examples:

import torch from diffusers import StableCascadePriorPipeline

prior_pipe = StableCascadePriorPipeline.from_pretrained( ... "stabilityai/stable-cascade-prior", torch_dtype=torch.bfloat16 ... ).to("cuda")

prompt = "an image of a shiba inu, donning a spacesuit and helmet" prior_output = pipe(prompt)

StableCascadePriorPipelineOutput

class diffusers.pipelines.stable_cascade.pipeline_stable_cascade_prior.StableCascadePriorPipelineOutput

< source >

( image_embeddings: typing.Union[torch.Tensor, numpy.ndarray] prompt_embeds: typing.Union[torch.Tensor, numpy.ndarray] prompt_embeds_pooled: typing.Union[torch.Tensor, numpy.ndarray] negative_prompt_embeds: typing.Union[torch.Tensor, numpy.ndarray] negative_prompt_embeds_pooled: typing.Union[torch.Tensor, numpy.ndarray] )

Parameters

Output class for WuerstchenPriorPipeline.

StableCascadeDecoderPipeline

class diffusers.StableCascadeDecoderPipeline

< source >

( decoder: StableCascadeUNet tokenizer: CLIPTokenizer text_encoder: CLIPTextModelWithProjection scheduler: DDPMWuerstchenScheduler vqgan: PaellaVQModel latent_dim_scale: float = 10.67 )

Parameters

Pipeline for generating images from the Stable Cascade model.

This model inherits from DiffusionPipeline. Check the superclass documentation for the generic methods the library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)

__call__

< source >

( image_embeddings: typing.Union[torch.Tensor, typing.List[torch.Tensor]] prompt: typing.Union[str, typing.List[str]] = None num_inference_steps: int = 10 guidance_scale: float = 0.0 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None prompt_embeds: typing.Optional[torch.Tensor] = None prompt_embeds_pooled: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds_pooled: typing.Optional[torch.Tensor] = None num_images_per_prompt: int = 1 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True callback_on_step_end: typing.Optional[typing.Callable[[int, int, typing.Dict], NoneType]] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] )

Parameters

Function invoked when calling the pipeline for generation.

Examples:

import torch from diffusers import StableCascadePriorPipeline, StableCascadeDecoderPipeline

prior_pipe = StableCascadePriorPipeline.from_pretrained( ... "stabilityai/stable-cascade-prior", torch_dtype=torch.bfloat16 ... ).to("cuda") gen_pipe = StableCascadeDecoderPipeline.from_pretrain( ... "stabilityai/stable-cascade", torch_dtype=torch.float16 ... ).to("cuda")

prompt = "an image of a shiba inu, donning a spacesuit and helmet" prior_output = pipe(prompt) images = gen_pipe(prior_output.image_embeddings, prompt=prompt)

< > Update on GitHub