GitHub - modelscope/DiffSynth-Studio: Enjoy the magic of Diffusion models! (original) (raw)

modelscope%2FDiffSynth-Studio | Trendshift

PyPI license open issues GitHub pull-requests GitHub latest commit Discord

切换到中文版

Introduction

DiffSynth-Studio Documentation: 中文版English version

Welcome to the magical world of Diffusion models! DiffSynth-Studio is an open-source Diffusion model engine developed and maintained by the ModelScope Community. We hope to foster technological innovation through framework construction, aggregate the power of the open-source community, and explore the boundaries of generative model technology!

DiffSynth currently includes two open-source projects:

DiffSynth-Studio and DiffSynth-Engine are the core engines of the ModelScope AIGC zone. Welcome to experience our carefully crafted productized features:

We believe that a well-developed open-source code framework can lower the threshold for technical exploration. We have achieved many interesting technologies based on this codebase. Perhaps you also have many wild ideas, and with DiffSynth-Studio, you can quickly realize these ideas. For this reason, we have prepared detailed documentation for developers. We hope that through these documents, developers can understand the principles of Diffusion models, and we look forward to expanding the boundaries of technology together with you.

Update History

DiffSynth-Studio has undergone major version updates, and some old features are no longer maintained. If you need to use old features, please switch to the last historical version before the major version update.

Currently, the development personnel of this project are limited, with most of the work handled by Artiprocher and mi804. Therefore, the progress of new feature development will be relatively slow, and the speed of responding to and resolving issues is limited. We apologize for this and ask developers to understand.

Installation

Install from source (recommended):

git clone https://github.com/modelscope/DiffSynth-Studio.git  
cd DiffSynth-Studio
pip install -e .

For more installation methods and instructions for non-NVIDIA GPUs, please refer to the Installation Guide.

Basic Framework

DiffSynth-Studio redesigns the inference and training pipelines for mainstream Diffusion models (including FLUX, Wan, etc.), enabling efficient memory management and flexible model training.

Environment Variable Configuration

Before running model inference or training, you can configure settings such as the model download source via environment variables.

By default, this project downloads models from ModelScope. For users outside China, you can configure the system to download models from the ModelScope international site as follows:

import os os.environ["MODELSCOPE_DOMAIN"] = "www.modelscope.ai"

To download models from other sources, please modify the environment variable DIFFSYNTH_DOWNLOAD_SOURCE.

Image Synthesis

Image

Z-Image: /docs/en/Model_Details/Z-Image.md

Quick Start

Running the following code will quickly load the Tongyi-MAI/Z-Image-Turbo model for inference. FP8 quantization significantly degrades image quality, so we do not recommend enabling any quantization for the Z-Image Turbo model. CPU offloading is recommended, and the model can run with as little as 8 GB of GPU memory.

from diffsynth.pipelines.z_image import ZImagePipeline, ModelConfig import torch

vram_config = { "offload_dtype": torch.bfloat16, "offload_device": "cpu", "onload_dtype": torch.bfloat16, "onload_device": "cpu", "preparing_dtype": torch.bfloat16, "preparing_device": "cuda", "computation_dtype": torch.bfloat16, "computation_device": "cuda", } pipe = ZImagePipeline.from_pretrained( torch_dtype=torch.bfloat16, device="cuda", model_configs=[ ModelConfig(model_id="Tongyi-MAI/Z-Image-Turbo", origin_file_pattern="transformer/*.safetensors", *vram_config), ModelConfig(model_id="Tongyi-MAI/Z-Image-Turbo", origin_file_pattern="text_encoder/.safetensors", **vram_config), ModelConfig(model_id="Tongyi-MAI/Z-Image-Turbo", origin_file_pattern="vae/diffusion_pytorch_model.safetensors", **vram_config), ], tokenizer_config=ModelConfig(model_id="Tongyi-MAI/Z-Image-Turbo", origin_file_pattern="tokenizer/"), vram_limit=torch.cuda.mem_get_info("cuda")[1] / (1024 ** 3) - 0.5, ) prompt = "Young Chinese woman in red Hanfu, intricate embroidery. Impeccable makeup, red floral forehead pattern. Elaborate high bun, golden phoenix headdress, red flowers, beads. Holds round folding fan with lady, trees, bird. Neon lightning-bolt lamp (⚡️), bright yellow glow, above extended left palm. Soft-lit outdoor night background, silhouetted tiered pagoda (西安大雁塔), blurred colorful distant lights." image = pipe(prompt=prompt, seed=42, rand_device="cuda") image.save("image.jpg")

Examples

Example code for Z-Image is available at: /examples/z_image/

Model ID Inference Low VRAM Inference Full Training Validation After Full Training LoRA Training Validation After LoRA Training
Tongyi-MAI/Z-Image code code code code code code
DiffSynth-Studio/Z-Image-i2L code code - - - -
Tongyi-MAI/Z-Image-Turbo code code code code code code
PAI/Z-Image-Turbo-Fun-Controlnet-Union-2.1 code code code code code code
PAI/Z-Image-Turbo-Fun-Controlnet-Union-2.1-8steps code code code code code code
PAI/Z-Image-Turbo-Fun-Controlnet-Tile-2.1-8steps code code code code code code
DiffSynth-Studio/ZImage-i2L-v2 code code code code - -

Stable Diffusion: /docs/en/Model_Details/Stable-Diffusion.md

Quick Start

Running the following code will quickly load the AI-ModelScope/stable-diffusion-v1-5 model for inference. VRAM management is enabled, the framework automatically controls parameter loading based on available VRAM, requiring a minimum of 2GB VRAM.

import torch from diffsynth.core import ModelConfig from diffsynth.pipelines.stable_diffusion import StableDiffusionPipeline

vram_config = { "offload_dtype": torch.float32, "offload_device": "cpu", "onload_dtype": torch.float32, "onload_device": "cpu", "preparing_dtype": torch.float32, "preparing_device": "cuda", "computation_dtype": torch.float32, "computation_device": "cuda", } pipe = StableDiffusionPipeline.from_pretrained( torch_dtype=torch.float32, model_configs=[ ModelConfig(model_id="AI-ModelScope/stable-diffusion-v1-5", origin_file_pattern="text_encoder/model.safetensors", **vram_config), ModelConfig(model_id="AI-ModelScope/stable-diffusion-v1-5", origin_file_pattern="unet/diffusion_pytorch_model.safetensors", **vram_config), ModelConfig(model_id="AI-ModelScope/stable-diffusion-v1-5", origin_file_pattern="vae/diffusion_pytorch_model.safetensors", **vram_config), ], tokenizer_config=ModelConfig(model_id="AI-ModelScope/stable-diffusion-v1-5", origin_file_pattern="tokenizer/"), vram_limit=torch.cuda.mem_get_info("cuda")[1] / (1024 ** 3) - 0.5, )

image = pipe( prompt="a photo of an astronaut riding a horse on mars, high quality, detailed", negative_prompt="blurry, low quality, deformed", cfg_scale=7.5, height=512, width=512, seed=42, rand_device="cuda", num_inference_steps=50, ) image.save("image.jpg")

Examples

Example code for Stable Diffusion is available at: /examples/stable_diffusion/

Model ID Inference Low VRAM Inference Full Training Full Training Validation LoRA Training LoRA Training Validation
AI-ModelScope/stable-diffusion-v1-5 code code code code code code

Stable Diffusion XL: /docs/en/Model_Details/Stable-Diffusion-XL.md

Quick Start

Running the following code will quickly load the stabilityai/stable-diffusion-xl-base-1.0 model for inference. VRAM management is enabled, the framework automatically controls parameter loading based on available VRAM, requiring a minimum of 6GB VRAM.

import torch from diffsynth.core import ModelConfig from diffsynth.pipelines.stable_diffusion_xl import StableDiffusionXLPipeline

vram_config = { "offload_dtype": torch.float32, "offload_device": "cpu", "onload_dtype": torch.float32, "onload_device": "cpu", "preparing_dtype": torch.float32, "preparing_device": "cuda", "computation_dtype": torch.float32, "computation_device": "cuda", } pipe = StableDiffusionXLPipeline.from_pretrained( torch_dtype=torch.float32, model_configs=[ ModelConfig(model_id="stabilityai/stable-diffusion-xl-base-1.0", origin_file_pattern="text_encoder/model.safetensors", **vram_config), ModelConfig(model_id="stabilityai/stable-diffusion-xl-base-1.0", origin_file_pattern="text_encoder_2/model.safetensors", **vram_config), ModelConfig(model_id="stabilityai/stable-diffusion-xl-base-1.0", origin_file_pattern="unet/diffusion_pytorch_model.safetensors", **vram_config), ModelConfig(model_id="stabilityai/stable-diffusion-xl-base-1.0", origin_file_pattern="vae/diffusion_pytorch_model.safetensors", **vram_config), ], tokenizer_config=ModelConfig(model_id="stabilityai/stable-diffusion-xl-base-1.0", origin_file_pattern="tokenizer/"), tokenizer_2_config=ModelConfig(model_id="stabilityai/stable-diffusion-xl-base-1.0", origin_file_pattern="tokenizer_2/"), vram_limit=torch.cuda.mem_get_info("cuda")[1] / (1024 ** 3) - 0.5, )

image = pipe( prompt="a photo of an astronaut riding a horse on mars", negative_prompt="", cfg_scale=5.0, height=1024, width=1024, seed=42, num_inference_steps=50, ) image.save("image.jpg")

Examples

Example code for Stable Diffusion XL is available at: /examples/stable_diffusion_xl/

Model ID Inference Low VRAM Inference Full Training Full Training Validation LoRA Training LoRA Training Validation
stabilityai/stable-diffusion-xl-base-1.0 code code code code code code

FLUX.2: /docs/en/Model_Details/FLUX2.md

Quick Start

Running the following code will quickly load the black-forest-labs/FLUX.2-dev model for inference. VRAM management is enabled, and the framework automatically loads model parameters based on available GPU memory. The model can run with as little as 10 GB of VRAM.

from diffsynth.pipelines.flux2_image import Flux2ImagePipeline, ModelConfig import torch

vram_config = { "offload_dtype": "disk", "offload_device": "disk", "onload_dtype": torch.float8_e4m3fn, "onload_device": "cpu", "preparing_dtype": torch.float8_e4m3fn, "preparing_device": "cuda", "computation_dtype": torch.bfloat16, "computation_device": "cuda", } pipe = Flux2ImagePipeline.from_pretrained( torch_dtype=torch.bfloat16, device="cuda", model_configs=[ ModelConfig(model_id="black-forest-labs/FLUX.2-dev", origin_file_pattern="text_encoder/*.safetensors", *vram_config), ModelConfig(model_id="black-forest-labs/FLUX.2-dev", origin_file_pattern="transformer/.safetensors", **vram_config), ModelConfig(model_id="black-forest-labs/FLUX.2-dev", origin_file_pattern="vae/diffusion_pytorch_model.safetensors"), ], tokenizer_config=ModelConfig(model_id="black-forest-labs/FLUX.2-dev", origin_file_pattern="tokenizer/"), vram_limit=torch.cuda.mem_get_info("cuda")[1] / (1024 ** 3) - 0.5, ) prompt = "High resolution. A dreamy underwater portrait of a serene young woman in a flowing blue dress. Her hair floats softly around her face, strands delicately suspended in the water. Clear, shimmering light filters through, casting gentle highlights, while tiny bubbles rise around her. Her expression is calm, her features finely detailed—creating a tranquil, ethereal scene." image = pipe(prompt, seed=42, rand_device="cuda", num_inference_steps=50) image.save("image.jpg")

Examples

Example code for FLUX.2 is available at: /examples/flux2/

Model ID Inference Low-VRAM Inference Full Training Full Training Validation LoRA Training LoRA Training Validation
black-forest-labs/FLUX.2-dev code code - - code code
black-forest-labs/FLUX.2-klein-4B code code code code code code
black-forest-labs/FLUX.2-klein-9B code code code code code code
black-forest-labs/FLUX.2-klein-base-4B code code code code code code
black-forest-labs/FLUX.2-klein-base-9B code code code code code code
DiffSynth-Studio/Template-KleinBase4B-Aesthetic code code code code - -
DiffSynth-Studio/Template-KleinBase4B-Brightness code code code code - -
DiffSynth-Studio/Template-KleinBase4B-Age code code code code - -
DiffSynth-Studio/Template-KleinBase4B-ControlNet code code code code - -
DiffSynth-Studio/Template-KleinBase4B-Edit code code code code - -
DiffSynth-Studio/Template-KleinBase4B-Inpaint code code code code - -
DiffSynth-Studio/Template-KleinBase4B-PandaMeme code code code code - -
DiffSynth-Studio/Template-KleinBase4B-Sharpness code code code code - -
DiffSynth-Studio/Template-KleinBase4B-SoftRGB code code code code - -
DiffSynth-Studio/Template-KleinBase4B-Upscaler code code code code - -
DiffSynth-Studio/Template-KleinBase4B-ContentRef code code code code - -
DiffSynth-Studio/KleinBase4B-i2L-v2 code code code code - -

Anima: /docs/en/Model_Details/Anima.md

Quick Start

Run the following code to quickly load the circlestone-labs/Anima model and perform inference. VRAM management is enabled, and the framework will automatically control the loading of model parameters based on available VRAM. The model can run with a minimum of 8GB VRAM.

from diffsynth.pipelines.anima_image import AnimaImagePipeline, ModelConfig import torch

vram_config = { "offload_dtype": "disk", "offload_device": "disk", "onload_dtype": "disk", "onload_device": "disk", "preparing_dtype": torch.bfloat16, "preparing_device": "cuda", "computation_dtype": torch.bfloat16, "computation_device": "cuda", } pipe = AnimaImagePipeline.from_pretrained( torch_dtype=torch.bfloat16, device="cuda", model_configs=[ ModelConfig(model_id="circlestone-labs/Anima", origin_file_pattern="split_files/diffusion_models/anima-preview.safetensors", **vram_config), ModelConfig(model_id="circlestone-labs/Anima", origin_file_pattern="split_files/text_encoders/qwen_3_06b_base.safetensors", **vram_config), ModelConfig(model_id="circlestone-labs/Anima", origin_file_pattern="split_files/vae/qwen_image_vae.safetensors", **vram_config), ], tokenizer_config=ModelConfig(model_id="Qwen/Qwen3-0.6B", origin_file_pattern="./"), tokenizer_t5xxl_config=ModelConfig(model_id="stabilityai/stable-diffusion-3.5-large", origin_file_pattern="tokenizer_3/"), vram_limit=torch.cuda.mem_get_info("cuda")[1] / (1024 ** 3) - 0.5, ) prompt = "Masterpiece, best quality, solo, long hair, wavy hair, silver hair, blue eyes, blue dress, medium breasts, dress, underwater, air bubble, floating hair, refraction, portrait." negative_prompt = "worst quality, low quality, monochrome, zombie, interlocked fingers, Aissist, cleavage, nsfw," image = pipe(prompt, seed=0, num_inference_steps=50) image.save("image.jpg")

Examples

Example code for Anima is located at: /examples/anima/

Model ID Inference Low VRAM Inference Full Training Validation after Full Training LoRA Training Validation after LoRA Training
circlestone-labs/Anima code code code code code code

Qwen-Image: /docs/en/Model_Details/Qwen-Image.md

Quick Start

Running the following code will quickly load the Qwen/Qwen-Image model for inference. VRAM management is enabled, and the framework automatically adjusts model parameter loading based on available GPU memory. The model can run with as little as 8 GB of VRAM.

from diffsynth.pipelines.qwen_image import QwenImagePipeline, ModelConfig import torch

vram_config = { "offload_dtype": "disk", "offload_device": "disk", "onload_dtype": torch.float8_e4m3fn, "onload_device": "cpu", "preparing_dtype": torch.float8_e4m3fn, "preparing_device": "cuda", "computation_dtype": torch.bfloat16, "computation_device": "cuda", } pipe = QwenImagePipeline.from_pretrained( torch_dtype=torch.bfloat16, device="cuda", model_configs=[ ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="transformer/diffusion_pytorch_model*.safetensors", *vram_config), ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="text_encoder/model.safetensors", **vram_config), ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="vae/diffusion_pytorch_model.safetensors", **vram_config), ], tokenizer_config=ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="tokenizer/"), vram_limit=torch.cuda.mem_get_info("cuda")[1] / (1024 ** 3) - 0.5, ) prompt = "精致肖像,水下少女,蓝裙飘逸,发丝轻扬,光影透澈,气泡环绕,面容恬静,细节精致,梦幻唯美。" image = pipe(prompt, seed=0, num_inference_steps=40) image.save("image.jpg")

Model Lineage

graph LR; Qwen/Qwen-Image-->Qwen/Qwen-Image-Edit; Qwen/Qwen-Image-Edit-->Qwen/Qwen-Image-Edit-2509; Qwen/Qwen-Image-->EliGen-Series; EliGen-Series-->DiffSynth-Studio/Qwen-Image-EliGen; DiffSynth-Studio/Qwen-Image-EliGen-->DiffSynth-Studio/Qwen-Image-EliGen-V2; EliGen-Series-->DiffSynth-Studio/Qwen-Image-EliGen-Poster; Qwen/Qwen-Image-->Distill-Series; Distill-Series-->DiffSynth-Studio/Qwen-Image-Distill-Full; Distill-Series-->DiffSynth-Studio/Qwen-Image-Distill-LoRA; Qwen/Qwen-Image-->ControlNet-Series; ControlNet-Series-->Blockwise-ControlNet-Series; Blockwise-ControlNet-Series-->DiffSynth-Studio/Qwen-Image-Blockwise-ControlNet-Canny; Blockwise-ControlNet-Series-->DiffSynth-Studio/Qwen-Image-Blockwise-ControlNet-Depth; Blockwise-ControlNet-Series-->DiffSynth-Studio/Qwen-Image-Blockwise-ControlNet-Inpaint; ControlNet-Series-->DiffSynth-Studio/Qwen-Image-In-Context-Control-Union; Qwen/Qwen-Image-->DiffSynth-Studio/Qwen-Image-Edit-Lowres-Fix;

Loading

Examples

Example code for Qwen-Image is available at: /examples/qwen_image/

Model ID Inference Low-VRAM Inference Full Training Full Training Validation LoRA Training LoRA Training Validation
Qwen/Qwen-Image code code code code code code
Qwen/Qwen-Image-2512 code code code code code code
Qwen/Qwen-Image-Edit code code code code code code
Qwen/Qwen-Image-Edit-2509 code code code code code code
Qwen/Qwen-Image-Edit-2511 code code code code code code
FireRedTeam/FireRed-Image-Edit-1.0 code code code code code code
FireRedTeam/FireRed-Image-Edit-1.1 code code code code code code
lightx2v/Qwen-Image-Edit-2511-Lightning code code - - - -
Qwen/Qwen-Image-Layered code code code code code code
DiffSynth-Studio/Qwen-Image-Layered-Control code code code code code code
DiffSynth-Studio/Qwen-Image-Layered-Control-V2 code code - - code code
DiffSynth-Studio/Qwen-Image-EliGen code code - - code code
DiffSynth-Studio/Qwen-Image-EliGen-V2 code code - - code code
DiffSynth-Studio/Qwen-Image-EliGen-Poster code code - - code code
DiffSynth-Studio/Qwen-Image-Distill-Full code code code code code code
DiffSynth-Studio/Qwen-Image-Distill-LoRA code code - - code code
DiffSynth-Studio/Qwen-Image-Blockwise-ControlNet-Canny code code code code code code
DiffSynth-Studio/Qwen-Image-Blockwise-ControlNet-Depth code code code code code code
DiffSynth-Studio/Qwen-Image-Blockwise-ControlNet-Inpaint code code code code code code
DiffSynth-Studio/Qwen-Image-In-Context-Control-Union code code - - code code
DiffSynth-Studio/Qwen-Image-Edit-Lowres-Fix code code - - - -
DiffSynth-Studio/Qwen-Image-i2L code code - - - -

FLUX.1: /docs/en/Model_Details/FLUX.md

Quick Start

Running the following code will quickly load the black-forest-labs/FLUX.1-dev model for inference. VRAM management is enabled, and the framework automatically adjusts model parameter loading based on available GPU memory. The model can run with as little as 8 GB of VRAM.

import torch from diffsynth.pipelines.flux_image import FluxImagePipeline, ModelConfig

vram_config = { "offload_dtype": torch.float8_e4m3fn, "offload_device": "cpu", "onload_dtype": torch.float8_e4m3fn, "onload_device": "cpu", "preparing_dtype": torch.float8_e4m3fn, "preparing_device": "cuda", "computation_dtype": torch.bfloat16, "computation_device": "cuda", } pipe = FluxImagePipeline.from_pretrained( torch_dtype=torch.bfloat16, device="cuda", model_configs=[ ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="flux1-dev.safetensors", **vram_config), ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder/model.safetensors", *vram_config), ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder_2/.safetensors", **vram_config), ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="ae.safetensors", **vram_config), ], vram_limit=torch.cuda.mem_get_info("cuda")[1] / (1024 ** 3) - 1, ) prompt = "CG, masterpiece, best quality, solo, long hair, wavy hair, silver hair, blue eyes, blue dress, medium breasts, dress, underwater, air bubble, floating hair, refraction, portrait. The girl's flowing silver hair shimmers with every color of the rainbow and cascades down, merging with the floating flora around her." image = pipe(prompt=prompt, seed=0) image.save("image.jpg")

Model Lineage

graph LR; FLUX.1-Series-->black-forest-labs/FLUX.1-dev; FLUX.1-Series-->black-forest-labs/FLUX.1-Krea-dev; FLUX.1-Series-->black-forest-labs/FLUX.1-Kontext-dev; black-forest-labs/FLUX.1-dev-->FLUX.1-dev-ControlNet-Series; FLUX.1-dev-ControlNet-Series-->alimama-creative/FLUX.1-dev-Controlnet-Inpainting-Beta; FLUX.1-dev-ControlNet-Series-->InstantX/FLUX.1-dev-Controlnet-Union-alpha; FLUX.1-dev-ControlNet-Series-->jasperai/Flux.1-dev-Controlnet-Upscaler; black-forest-labs/FLUX.1-dev-->InstantX/FLUX.1-dev-IP-Adapter; black-forest-labs/FLUX.1-dev-->ByteDance/InfiniteYou; black-forest-labs/FLUX.1-dev-->DiffSynth-Studio/Eligen; black-forest-labs/FLUX.1-dev-->DiffSynth-Studio/LoRA-Encoder-FLUX.1-Dev; black-forest-labs/FLUX.1-dev-->DiffSynth-Studio/LoRAFusion-preview-FLUX.1-dev; black-forest-labs/FLUX.1-dev-->ostris/Flex.2-preview; black-forest-labs/FLUX.1-dev-->stepfun-ai/Step1X-Edit; Qwen/Qwen2.5-VL-7B-Instruct-->stepfun-ai/Step1X-Edit; black-forest-labs/FLUX.1-dev-->DiffSynth-Studio/Nexus-GenV2; Qwen/Qwen2.5-VL-7B-Instruct-->DiffSynth-Studio/Nexus-GenV2;

Loading

Examples

Example code for FLUX.1 is available at: /examples/flux/

Model ID Extra Args Inference Low-VRAM Inference Full Training Full Training Validation LoRA Training LoRA Training Validation
black-forest-labs/FLUX.1-dev code code code code code code
black-forest-labs/FLUX.1-Krea-dev code code code code code code
black-forest-labs/FLUX.1-Kontext-dev kontext_images code code code code code code
alimama-creative/FLUX.1-dev-Controlnet-Inpainting-Beta controlnet_inputs code code code code code code
InstantX/FLUX.1-dev-Controlnet-Union-alpha controlnet_inputs code code code code code code
jasperai/Flux.1-dev-Controlnet-Upscaler controlnet_inputs code code code code code code
InstantX/FLUX.1-dev-IP-Adapter ipadapter_images, ipadapter_scale code code code code code code
ByteDance/InfiniteYou infinityou_id_image, infinityou_guidance, controlnet_inputs code code code code code code
DiffSynth-Studio/Eligen eligen_entity_prompts, eligen_entity_masks, eligen_enable_on_negative, eligen_enable_inpaint code code - - code code
DiffSynth-Studio/LoRA-Encoder-FLUX.1-Dev lora_encoder_inputs, lora_encoder_scale code code code code - -
DiffSynth-Studio/LoRAFusion-preview-FLUX.1-dev code - - - - -
stepfun-ai/Step1X-Edit step1x_reference_image code code code code code code
ostris/Flex.2-preview flex_inpaint_image, flex_inpaint_mask, flex_control_image, flex_control_strength, flex_control_stop code code code code code code
DiffSynth-Studio/Nexus-GenV2 nexus_gen_reference_image code code code code code code

ERNIE-Image: /docs/en/Model_Details/ERNIE-Image.md

Quick Start

Running the following code will quickly load the PaddlePaddle/ERNIE-Image model and perform inference. VRAM management is enabled, and the framework will automatically control the loading of model parameters based on available VRAM. The model can run with a minimum of 3GB VRAM.

from diffsynth.pipelines.ernie_image import ErnieImagePipeline, ModelConfig import torch

vram_config = { "offload_dtype": torch.bfloat16, "offload_device": "cpu", "onload_dtype": torch.bfloat16, "onload_device": "cpu", "preparing_dtype": torch.bfloat16, "preparing_device": "cuda", "computation_dtype": torch.bfloat16, "computation_device": "cuda", } pipe = ErnieImagePipeline.from_pretrained( torch_dtype=torch.bfloat16, device='cuda', model_configs=[ ModelConfig(model_id="PaddlePaddle/ERNIE-Image", origin_file_pattern="transformer/diffusion_pytorch_model*.safetensors", **vram_config), ModelConfig(model_id="PaddlePaddle/ERNIE-Image", origin_file_pattern="text_encoder/model.safetensors", **vram_config), ModelConfig(model_id="PaddlePaddle/ERNIE-Image", origin_file_pattern="vae/diffusion_pytorch_model.safetensors", **vram_config), ], tokenizer_config=ModelConfig(model_id="PaddlePaddle/ERNIE-Image", origin_file_pattern="tokenizer/"), vram_limit=torch.cuda.mem_get_info("cuda")[1] / (1024 ** 3) - 0.5, )

image = pipe( prompt="一只黑白相间的中华田园犬", negative_prompt="", height=1024, width=1024, seed=42, num_inference_steps=50, cfg_scale=4.0, ) image.save("output.jpg")

Examples

Example code for ERNIE-Image is available at: /examples/ernie_image/

Model ID Inference Low VRAM Inference Full Training Full Training Validation LoRA Training LoRA Training Validation
PaddlePaddle/ERNIE-Image code code code code code code
PaddlePaddle/ERNIE-Image-Turbo code code

JoyAI-Image: /docs/en/Model_Details/JoyAI-Image.md

Quick Start

Running the following code will quickly load the jd-opensource/JoyAI-Image-Edit model and perform inference. VRAM management is enabled, and the framework will automatically control the loading of model parameters based on available VRAM. The model can run with a minimum of 4GB VRAM.

from diffsynth.pipelines.joyai_image import JoyAIImagePipeline, ModelConfig import torch from PIL import Image from modelscope import dataset_snapshot_download

Download dataset

dataset_snapshot_download( dataset_id="DiffSynth-Studio/diffsynth_example_dataset", local_dir="data/diffsynth_example_dataset", allow_file_pattern="joyai_image/JoyAI-Image-Edit/*" )

vram_config = { "offload_dtype": torch.bfloat16, "offload_device": "cpu", "onload_dtype": torch.bfloat16, "onload_device": "cpu", "preparing_dtype": torch.bfloat16, "preparing_device": "cuda", "computation_dtype": torch.bfloat16, "computation_device": "cuda", }

pipe = JoyAIImagePipeline.from_pretrained( torch_dtype=torch.bfloat16, device="cuda", model_configs=[ ModelConfig(model_id="jd-opensource/JoyAI-Image-Edit", origin_file_pattern="transformer/transformer.pth", *vram_config), ModelConfig(model_id="jd-opensource/JoyAI-Image-Edit", origin_file_pattern="JoyAI-Image-Und/model.safetensors", **vram_config), ModelConfig(model_id="jd-opensource/JoyAI-Image-Edit", origin_file_pattern="vae/Wan2.1_VAE.pth", **vram_config), ], processor_config=ModelConfig(model_id="jd-opensource/JoyAI-Image-Edit", origin_file_pattern="JoyAI-Image-Und/"), vram_limit=torch.cuda.mem_get_info("cuda")[1] / (1024 ** 3) - 0.5, )

Use first sample from dataset

dataset_base_path = "data/diffsynth_example_dataset/joyai_image/JoyAI-Image-Edit" prompt = "将裙子改为粉色" edit_image = Image.open(f"{dataset_base_path}/edit/image1.jpg").convert("RGB")

output = pipe( prompt=prompt, edit_image=edit_image, height=1024, width=1024, seed=0, num_inference_steps=30, cfg_scale=5.0, )

output.save("output_joyai_edit_low_vram.png")

Examples

Example code for JoyAI-Image is available at: /examples/joyai_image/

Model ID Inference Low VRAM Inference Full Training Full Training Validation LoRA Training LoRA Training Validation
jd-opensource/JoyAI-Image-Edit code code code code code code

HiDream-O1-Image: /docs/en/Model_Details/HiDream-O1-Image.md

Quick Start

Running the following code will quickly load the HiDream-ai/HiDream-O1-Image model and perform inference. VRAM management is enabled, and the framework will automatically control the loading of model parameters based on available VRAM. The model can run with a minimum of 3GB VRAM.

from diffsynth.pipelines.hidream_o1_image import HiDreamO1ImagePipeline from diffsynth.core.loader.config import ModelConfig import torch

vram_config = { "offload_dtype": torch.bfloat16, "offload_device": "cpu", "onload_dtype": torch.bfloat16, "onload_device": "cpu", "preparing_dtype": torch.bfloat16, "preparing_device": "cuda", "computation_dtype": torch.bfloat16, "computation_device": "cuda", }

pipe = HiDreamO1ImagePipeline.from_pretrained( torch_dtype=torch.bfloat16, device="cuda", model_configs=[ ModelConfig(model_id="HiDream-ai/HiDream-O1-Image", origin_file_pattern="model-*.safetensors", **vram_config), ], processor_config=ModelConfig(model_id="HiDream-ai/HiDream-O1-Image", origin_file_pattern="./"), vram_limit=torch.cuda.mem_get_info("cuda")[1] / (1024 ** 3) - 0.5, ) image = pipe( prompt="medium shot, eye-level, front view. A woman is seated in an ornate bedroom, illuminated by candlelight, with a calm and composed expression. The subject is a young woman with fair skin, light brown hair styled in an updo with loose tendrils framing her face, and blue eyes. She wears a cream-colored satin robe with delicate floral embroidery and lace trim along the neckline. Her ears are adorned with pearl drop earrings. She is seated on a bed with a dark, intricately carved wooden headboard. To her left, a wooden nightstand holds three lit white candles and a candelabra with multiple lit candles in the background. The bed is covered with patterned pillows and a dark, textured blanket. The walls are paneled with dark wood and feature a large, ornate tapestry with muted earth tones. The lighting creates soft highlights on her face and robe, with warm shadows cast across the room.", negative_prompt=" ", cfg_scale=4.0, height=2048, width=2048, seed=42, num_inference_steps=50, ) image.save("image.jpg")

Examples

Example code for HiDream-O1-Image is available at: /examples/hidream_o1_image/

Model ID Inference Low VRAM Inference Full Training Full Training Validation LoRA Training LoRA Training Validation
HiDream-ai/HiDream-O1-Image code code code code code code
HiDream-ai/HiDream-O1-Image-Dev code code code code code code
DiffSynth-Studio/HidreamO1-i2L-v2 code code code code - -

Ideogram 4: /docs/en/Model_Details/Ideogram-4.md

Quick Start

Running the following code will quickly load the ideogram-ai/ideogram-4-fp8 model and perform inference. The model can run with a minimum of 24GB VRAM.

from diffsynth.pipelines.ideogram4 import Ideogram4Pipeline from diffsynth.core import ModelConfig import torch

pipe = Ideogram4Pipeline.from_pretrained( torch_dtype=torch.bfloat16, device="cuda", model_configs=[ ModelConfig(model_id="ideogram-ai/ideogram-4-fp8", origin_file_pattern="transformer/diffusion_pytorch_model.safetensors"), # unconditional_transformer is optional. You can delete this line to reduce VRAM required. ModelConfig(model_id="ideogram-ai/ideogram-4-fp8", origin_file_pattern="unconditional_transformer/diffusion_pytorch_model.safetensors"), ModelConfig(model_id="ideogram-ai/ideogram-4-fp8", origin_file_pattern="text_encoder/model.safetensors"), ModelConfig(model_id="ideogram-ai/ideogram-4-fp8", origin_file_pattern="vae/diffusion_pytorch_model.safetensors"), ], tokenizer_config=ModelConfig(model_id="ideogram-ai/ideogram-4-fp8", origin_file_pattern="tokenizer/"), ) prompt = r""" { "high_level_description": "A medium-shot photograph of Formula 1 driver Max Verstappen wearing his Red Bull Racing racing suit and cap, smiling as he holds his racing helmet and talks to a man in a white shirt and black vest at a race track.", "style_description": { "aesthetics": "saturated primary colors, rule of thirds, joyful and triumphant", "lighting": "overcast daylight, diffused, soft subtle shadows", "photo": "shallow depth of field, sharp focus, eye-level, telephoto", "medium": "photograph" }, "compositional_deconstruction": { "background": "The background is an out-of-focus racing paddock or track environment. Several blurred figures are visible, including one in an orange shirt. A purple and white structure with a red 'F1' logo stands on the left. The scene is outdoors with daylight, though the sky is not visible.", "elements": [ {"type": "obj", "bbox": [55, 642, 1000, 937], "desc": "An older man standing in profile, facing left toward Max Verstappen. He has grey hair and fair skin. He is wearing a white long-sleeved button-down shirt with a navy blue quilted vest over it. He has a slight smile."}, {"type": "obj", "bbox": [34, 137, 1000, 617], "desc": "Max Verstappen, a fair-skinned male Formula 1 driver, positioned in the center. He is facing forward with a joyful expression and a slight smile. He wears a navy blue Red Bull Racing team uniform with numerous sponsor logos and a matching baseball cap with the number '1'. He is holding a white and red racing helmet in his hands. He has a silver watch on his left wrist."}, {"type": "obj", "bbox": [422, 212, 792, 452], "desc": "Max Verstappen's racing helmet, held in front of his chest. It features a white, red, and yellow design with the Red Bull logo and the 'Player 0.0' branding. The visor is clear and open."}, {"type": "text", "bbox": [657, 0, 755, 142], "text": "F1", "desc": "Large, stylized red logo on a black and purple background in the lower left."}, {"type": "text", "bbox": [768, 0, 818, 147], "text": "Formula 1\nWorld Championship™", "desc": "Small white sans-serif text below the F1 logo on the left side."}, {"type": "text", "bbox": [78, 447, 117, 510], "text": "ORACLE\nRed Bull\nRacing", "desc": "Very small white and orange logo on the front of the navy blue cap."}, {"type": "text", "bbox": [78, 417, 120, 440], "text": "1", "desc": "Bold red numeral '1' on the front left side of the navy blue cap."}, {"type": "text", "bbox": [332, 442, 363, 483], "text": "Red Bull", "desc": "Small yellow and red text logo on the collar of the uniform."}, {"type": "text", "bbox": [373, 490, 423, 532], "text": "RAUCH", "desc": "Small yellow and blue logo on the right chest of the uniform."}, {"type": "text", "bbox": [422, 473, 500, 532], "text": "BYBIT\nHONDA", "desc": "Medium-sized white sans-serif text on the right chest of the uniform."}, {"type": "text", "bbox": [410, 203, 442, 257], "text": "RAUCH", "desc": "Small yellow logo on the left upper arm of the uniform."}, {"type": "text", "bbox": [530, 448, 627, 510], "text": "Red Bull", "desc": "Medium red text logo on the right side of the torso, part of the Red Bull graphic."}, {"type": "text", "bbox": [680, 417, 768, 523], "text": "Red Bull", "desc": "Large red text logo across the lower torso of the uniform."}, {"type": "text", "bbox": [797, 475, 815, 518], "text": "MAX", "desc": "Small white text next to a Dutch flag on the belt area of the uniform."}, {"type": "text", "bbox": [558, 317, 715, 355], "text": "Player 0.0", "desc": "Black sans-serif text on a white band on the racing helmet."}, {"type": "text", "bbox": [560, 800, 582, 835], "text": "IA.COM", "desc": "Small blue sans-serif text on the right sleeve of the white shirt."}, {"type": "text", "bbox": [968, 8, 997, 332], "text": "© Anadolu Agency via Getty Images", "desc": "Small white watermark text in the bottom left corner."} ] } } """ image = pipe(prompt=prompt, height=1024, width=1024, num_inference_steps=48, cfg_scale=7.0, seed=42) image.save("image_ideogram-4-fp8.jpg")

Examples

Example code for Ideogram 4 is available at: /examples/ideogram4/

Model ID Inference Low VRAM Inference Full Training Full Training Validation LoRA Training LoRA Training Validation
ideogram-ai/ideogram-4-fp8 code - - - - -
DiffSynth-Studio/ideogram-4-bf16-repackage code code code - code code

Video Synthesis

video1.mp4

LTX-2: /docs/en/Model_Details/LTX-2.md

Quick Start

Running the following code will quickly load the Lightricks/LTX-2 model for inference. VRAM management is enabled, and the framework automatically adjusts model parameter loading based on available GPU memory. The model can run with as little as 8GB of VRAM.

import torch from diffsynth.pipelines.ltx2_audio_video import LTX2AudioVideoPipeline, ModelConfig from diffsynth.utils.data.media_io_ltx2 import write_video_audio_ltx2

vram_config = { "offload_dtype": torch.float8_e5m2, "offload_device": "cpu", "onload_dtype": torch.float8_e5m2, "onload_device": "cpu", "preparing_dtype": torch.float8_e5m2, "preparing_device": "cuda", "computation_dtype": torch.bfloat16, "computation_device": "cuda", } """ Offical model repo: https://www.modelscope.cn/models/Lightricks/LTX-2 Repackaged model repo: https://www.modelscope.cn/models/DiffSynth-Studio/LTX-2-Repackage For base models of LTX-2, offical checkpoint (with model config ModelConfig(model_id="Lightricks/LTX-2", origin_file_pattern="ltx-2-19b-dev.safetensors")) and repackaged checkpoints (with model config ModelConfig(model_id="DiffSynth-Studio/LTX-2-Repackage", origin_file_pattern="*.safetensors")) are both supported. We have repackeged the official checkpoints in DiffSynth-Studio/LTX-2-Repackage repo to support separate loading of different submodules, and avoid redundant memory usage when users only want to use part of the model. """

use the repackaged modelconfig from "DiffSynth-Studio/LTX-2-Repackage" to avoid redundant model loading

pipe = LTX2AudioVideoPipeline.from_pretrained( torch_dtype=torch.bfloat16, device="cuda", model_configs=[ ModelConfig(model_id="google/gemma-3-12b-it-qat-q4_0-unquantized", origin_file_pattern="model-*.safetensors", **vram_config), ModelConfig(model_id="DiffSynth-Studio/LTX-2-Repackage", origin_file_pattern="transformer.safetensors", **vram_config), ModelConfig(model_id="DiffSynth-Studio/LTX-2-Repackage", origin_file_pattern="text_encoder_post_modules.safetensors", **vram_config), ModelConfig(model_id="DiffSynth-Studio/LTX-2-Repackage", origin_file_pattern="video_vae_decoder.safetensors", **vram_config), ModelConfig(model_id="DiffSynth-Studio/LTX-2-Repackage", origin_file_pattern="audio_vae_decoder.safetensors", **vram_config), ModelConfig(model_id="DiffSynth-Studio/LTX-2-Repackage", origin_file_pattern="audio_vocoder.safetensors", **vram_config), ModelConfig(model_id="DiffSynth-Studio/LTX-2-Repackage", origin_file_pattern="video_vae_encoder.safetensors", **vram_config), ModelConfig(model_id="Lightricks/LTX-2", origin_file_pattern="ltx-2-spatial-upscaler-x2-1.0.safetensors", **vram_config), ], tokenizer_config=ModelConfig(model_id="google/gemma-3-12b-it-qat-q4_0-unquantized"), stage2_lora_config=ModelConfig(model_id="Lightricks/LTX-2", origin_file_pattern="ltx-2-19b-distilled-lora-384.safetensors"), vram_limit=torch.cuda.mem_get_info("cuda")[1] / (1024 ** 3) - 0.5, )

use the following modelconfig if you want to initialize model from offical checkpoints from "Lightricks/LTX-2"

pipe = LTX2AudioVideoPipeline.from_pretrained(

torch_dtype=torch.bfloat16,

device="cuda",

model_configs=[

ModelConfig(model_id="google/gemma-3-12b-it-qat-q4_0-unquantized", origin_file_pattern="model-*.safetensors", **vram_config),

ModelConfig(model_id="Lightricks/LTX-2", origin_file_pattern="ltx-2-19b-dev.safetensors", **vram_config),

ModelConfig(model_id="Lightricks/LTX-2", origin_file_pattern="ltx-2-spatial-upscaler-x2-1.0.safetensors", **vram_config),

],

tokenizer_config=ModelConfig(model_id="google/gemma-3-12b-it-qat-q4_0-unquantized"),

stage2_lora_config=ModelConfig(model_id="Lightricks/LTX-2", origin_file_pattern="ltx-2-19b-distilled-lora-384.safetensors"),

vram_limit=torch.cuda.mem_get_info("cuda")[1] / (1024 ** 3) - 0.5,

)

prompt = "A girl is very happy, she is speaking: "I enjoy working with Diffsynth-Studio, it's a perfect framework."" negative_prompt = ( "blurry, out of focus, overexposed, underexposed, low contrast, washed out colors, excessive noise, " "grainy texture, poor lighting, flickering, motion blur, distorted proportions, unnatural skin tones, " "deformed facial features, asymmetrical face, missing facial features, extra limbs, disfigured hands, " "wrong hand count, artifacts around text, inconsistent perspective, camera shake, incorrect depth of " "field, background too sharp, background clutter, distracting reflections, harsh shadows, inconsistent " "lighting direction, color banding, cartoonish rendering, 3D CGI look, unrealistic materials, uncanny " "valley effect, incorrect ethnicity, wrong gender, exaggerated expressions, wrong gaze direction, " "mismatched lip sync, silent or muted audio, distorted voice, robotic voice, echo, background noise, " "off-sync audio, incorrect dialogue, added dialogue, repetitive speech, jittery movement, awkward " "pauses, incorrect timing, unnatural transitions, inconsistent framing, tilted camera, flat lighting, " "inconsistent tone, cinematic oversaturation, stylized filters, or AI artifacts." ) height, width, num_frames = 512 * 2, 768 * 2, 121 video, audio = pipe( prompt=prompt, negative_prompt=negative_prompt, seed=43, height=height, width=width, num_frames=num_frames, tiled=True, use_two_stage_pipeline=True, ) write_video_audio_ltx2( video=video, audio=audio, output_path='ltx2_twostage.mp4', fps=24, audio_sample_rate=24000, )

Examples

Example code for LTX-2 is available at: /examples/ltx2/

Model ID Extra Args Inference Low-VRAM Inference Full Training Full Training Validation LoRA Training LoRA Training Validation
jd-opensource/JoyAI-Echo code code code code code code
Lightricks/LTX-2.3: OneStagePipeline-I2AV input_images code code code code code code
Lightricks/LTX-2.3: TwoStagePipeline-I2AV input_images code code - - - -
Lightricks/LTX-2.3: DistilledPipeline-I2AV input_images code code - - - -
Lightricks/LTX-2.3: OneStagePipeline-T2AV code code code code code code
Lightricks/LTX-2.3: TwoStagePipeline-T2AV code code - - - -
Lightricks/LTX-2.3: DistilledPipeline-T2AV code code - - - -
Lightricks/LTX-2.3: A2V retake_audio,audio_sample_rate,retake_audio_regions code code - - - -
Lightricks/LTX-2.3: Retake retake_video,retake_video_regions,retake_audio,audio_sample_rate,retake_audio_regions code code - - - -
Lightricks/LTX-2.3-22b-IC-LoRA-Union-Control in_context_videos,in_context_downsample_factor code code - - code code
Lightricks/LTX-2.3-22b-IC-LoRA-Motion-Track-Control in_context_videos,in_context_downsample_factor code code - - code code
Lightricks/LTX-2: OneStagePipeline-T2AV code code code code code code
Lightricks/LTX-2-19b-IC-LoRA-Union-Control in_context_videos,in_context_downsample_factor code code - - code code
Lightricks/LTX-2-19b-IC-LoRA-Detailer in_context_videos,in_context_downsample_factor code code - - code code
Lightricks/LTX-2: TwoStagePipeline-T2AV code code - - - -
Lightricks/LTX-2: DistilledPipeline-T2AV code code - - - -
Lightricks/LTX-2: OneStagePipeline-I2AV input_images code code - - - -
Lightricks/LTX-2: TwoStagePipeline-I2AV input_images code code - - - -
Lightricks/LTX-2: DistilledPipeline-I2AV input_images code code - - - -
Lightricks/LTX-2-19b-LoRA-Camera-Control-Dolly-In code code - - - -
Lightricks/LTX-2-19b-LoRA-Camera-Control-Dolly-Out code code - - - -
Lightricks/LTX-2-19b-LoRA-Camera-Control-Dolly-Left code code - - - -
Lightricks/LTX-2-19b-LoRA-Camera-Control-Dolly-Right code code - - - -
Lightricks/LTX-2-19b-LoRA-Camera-Control-Jib-Up code code - - - -
Lightricks/LTX-2-19b-LoRA-Camera-Control-Jib-Down code code - - - -
Lightricks/LTX-2-19b-LoRA-Camera-Control-Static code code - - - -

Wan: /docs/en/Model_Details/Wan.md

Quick Start

Running the following code will quickly load the Wan-AI/Wan2.1-T2V-1.3B model for inference. VRAM management is enabled, and the framework automatically adjusts model parameter loading based on available GPU memory. The model can run with as little as 8 GB of VRAM.

import torch from diffsynth.utils.data import save_video, VideoData from diffsynth.pipelines.wan_video import WanVideoPipeline, ModelConfig

vram_config = { "offload_dtype": "disk", "offload_device": "disk", "onload_dtype": torch.bfloat16, "onload_device": "cpu", "preparing_dtype": torch.bfloat16, "preparing_device": "cuda", "computation_dtype": torch.bfloat16, "computation_device": "cuda", } pipe = WanVideoPipeline.from_pretrained( torch_dtype=torch.bfloat16, device="cuda", model_configs=[ ModelConfig(model_id="Wan-AI/Wan2.1-T2V-1.3B", origin_file_pattern="diffusion_pytorch_model*.safetensors", **vram_config), ModelConfig(model_id="Wan-AI/Wan2.1-T2V-1.3B", origin_file_pattern="models_t5_umt5-xxl-enc-bf16.pth", **vram_config), ModelConfig(model_id="Wan-AI/Wan2.1-T2V-1.3B", origin_file_pattern="Wan2.1_VAE.pth", **vram_config), ], tokenizer_config=ModelConfig(model_id="Wan-AI/Wan2.1-T2V-1.3B", origin_file_pattern="google/umt5-xxl/"), vram_limit=torch.cuda.mem_get_info("cuda")[1] / (1024 ** 3) - 2, )

video = pipe( prompt="纪实摄影风格画面,一只活泼的小狗在绿茵茵的草地上迅速奔跑。小狗毛色棕黄,两只耳朵立起,神情专注而欢快。阳光洒在它身上,使得毛发看上去格外柔软而闪亮。背景是一片开阔的草地,偶尔点缀着几朵野花,远处隐约可见蓝天和几片白云。透视感鲜明,捕捉小狗奔跑时的动感和四周草地的生机。中景侧面移动视角。", negative_prompt="色调艳丽,过曝,静态,细节模糊不清,字幕,风格,作品,画作,画面,静止,整体发灰,最差质量,低质量,JPEG压缩残留,丑陋的,残缺的,多余的手指,画得不好的手部,画得不好的脸部,畸形的,毁容的,形态畸形的肢体,手指融合,静止不动的画面,杂乱的背景,三条腿,背景人很多,倒着走", seed=0, tiled=True, ) save_video(video, "video.mp4", fps=15, quality=5)

Model Lineage

graph LR; Wan-Series-->Wan2.1-Series; Wan-Series-->Wan2.2-Series; Wan2.1-Series-->Wan-AI/Wan2.1-T2V-1.3B; Wan2.1-Series-->Wan-AI/Wan2.1-T2V-14B; Wan-AI/Wan2.1-T2V-14B-->Wan-AI/Wan2.1-I2V-14B-480P; Wan-AI/Wan2.1-I2V-14B-480P-->Wan-AI/Wan2.1-I2V-14B-720P; Wan-AI/Wan2.1-T2V-14B-->Wan-AI/Wan2.1-FLF2V-14B-720P; Wan-AI/Wan2.1-T2V-1.3B-->iic/VACE-Wan2.1-1.3B-Preview; iic/VACE-Wan2.1-1.3B-Preview-->Wan-AI/Wan2.1-VACE-1.3B; Wan-AI/Wan2.1-T2V-14B-->Wan-AI/Wan2.1-VACE-14B; Wan-AI/Wan2.1-T2V-1.3B-->Wan2.1-Fun-1.3B-Series; Wan2.1-Fun-1.3B-Series-->PAI/Wan2.1-Fun-1.3B-InP; Wan2.1-Fun-1.3B-Series-->PAI/Wan2.1-Fun-1.3B-Control; Wan-AI/Wan2.1-T2V-14B-->Wan2.1-Fun-14B-Series; Wan2.1-Fun-14B-Series-->PAI/Wan2.1-Fun-14B-InP; Wan2.1-Fun-14B-Series-->PAI/Wan2.1-Fun-14B-Control; Wan-AI/Wan2.1-T2V-1.3B-->Wan2.1-Fun-V1.1-1.3B-Series; Wan2.1-Fun-V1.1-1.3B-Series-->PAI/Wan2.1-Fun-V1.1-1.3B-Control; Wan2.1-Fun-V1.1-1.3B-Series-->PAI/Wan2.1-Fun-V1.1-1.3B-InP; Wan2.1-Fun-V1.1-1.3B-Series-->PAI/Wan2.1-Fun-V1.1-1.3B-Control-Camera; Wan-AI/Wan2.1-T2V-14B-->Wan2.1-Fun-V1.1-14B-Series; Wan2.1-Fun-V1.1-14B-Series-->PAI/Wan2.1-Fun-V1.1-14B-Control; Wan2.1-Fun-V1.1-14B-Series-->PAI/Wan2.1-Fun-V1.1-14B-InP; Wan2.1-Fun-V1.1-14B-Series-->PAI/Wan2.1-Fun-V1.1-14B-Control-Camera; Wan-AI/Wan2.1-T2V-1.3B-->DiffSynth-Studio/Wan2.1-1.3b-speedcontrol-v1; Wan-AI/Wan2.1-T2V-14B-->krea/krea-realtime-video; Wan-AI/Wan2.1-T2V-14B-->meituan-longcat/LongCat-Video; Wan-AI/Wan2.1-I2V-14B-720P-->ByteDance/Video-As-Prompt-Wan2.1-14B; Wan-AI/Wan2.1-T2V-14B-->Wan-AI/Wan2.2-Animate-14B; Wan-AI/Wan2.1-T2V-14B-->Wan-AI/Wan2.2-S2V-14B; Wan2.2-Series-->Wan-AI/Wan2.2-T2V-A14B; Wan2.2-Series-->Wan-AI/Wan2.2-I2V-A14B; Wan2.2-Series-->Wan-AI/Wan2.2-TI2V-5B; Wan-AI/Wan2.2-T2V-A14B-->Wan2.2-Fun-Series; Wan2.2-Fun-Series-->PAI/Wan2.2-VACE-Fun-A14B; Wan2.2-Fun-Series-->PAI/Wan2.2-Fun-A14B-InP; Wan2.2-Fun-Series-->PAI/Wan2.2-Fun-A14B-Control; Wan2.2-Fun-Series-->PAI/Wan2.2-Fun-A14B-Control-Camera;

Loading

Examples

Example code for Wan is available at: /examples/wanvideo/

Model ID Extra Inputs Inference Low VRAM Inference Full Training Validation After Full Training LoRA Training Validation After LoRA Training
Wan-AI/Wan2.1-T2V-1.3B code code code code code code
Wan-AI/Wan2.1-T2V-14B code code code code code code
Wan-AI/Wan2.1-I2V-14B-480P input_image code code code code code code
Wan-AI/Wan2.1-I2V-14B-720P input_image code code code code code code
Wan-AI/Wan2.1-FLF2V-14B-720P input_image, end_image code code code code code code
iic/VACE-Wan2.1-1.3B-Preview vace_control_video, vace_reference_image code code code code code code
Wan-AI/Wan2.1-VACE-1.3B vace_control_video, vace_reference_image code code code code code code
Wan-AI/Wan2.1-VACE-14B vace_control_video, vace_reference_image code code code code code code
PAI/Wan2.1-Fun-1.3B-InP input_image, end_image code code code code code code
PAI/Wan2.1-Fun-1.3B-Control control_video code code code code code code
PAI/Wan2.1-Fun-14B-InP input_image, end_image code code code code code code
PAI/Wan2.1-Fun-14B-Control control_video code code code code code code
PAI/Wan2.1-Fun-V1.1-1.3B-Control control_video, reference_image code code code code code code
PAI/Wan2.1-Fun-V1.1-14B-Control control_video, reference_image code code code code code code
PAI/Wan2.1-Fun-V1.1-1.3B-InP input_image, end_image code code code code code code
PAI/Wan2.1-Fun-V1.1-14B-InP input_image, end_image code code code code code code
PAI/Wan2.1-Fun-V1.1-1.3B-Control-Camera control_camera_video, input_image code code code code code code
PAI/Wan2.1-Fun-V1.1-14B-Control-Camera control_camera_video, input_image code code code code code code
DiffSynth-Studio/Wan2.1-1.3b-speedcontrol-v1 motion_bucket_id code code code code code code
krea/krea-realtime-video code code code code code code
meituan-longcat/LongCat-Video longcat_video code code code code code code
ByteDance/Video-As-Prompt-Wan2.1-14B vap_video, vap_prompt code code code code code code
Wan-AI/Wan2.2-T2V-A14B code code code code code code
Wan-AI/Wan2.2-I2V-A14B input_image code code code code code code
Wan-AI/Wan2.2-TI2V-5B input_image code code code code code code
Wan-AI/Wan2.2-Animate-14B input_image, animate_pose_video, animate_face_video, animate_inpaint_video, animate_mask_video code code code code code code
Wan-AI/Wan2.2-S2V-14B input_image, input_audio, audio_sample_rate, s2v_pose_video code code code code code code
PAI/Wan2.2-VACE-Fun-A14B vace_control_video, vace_reference_image code code code code code code
PAI/Wan2.2-Fun-A14B-InP input_image, end_image code code code code code code
PAI/Wan2.2-Fun-A14B-Control control_video, reference_image code code code code code code
PAI/Wan2.2-Fun-A14B-Control-Camera control_camera_video, input_image code code code code code code
openmoss/MOVA-360p input_image code code code code code code
openmoss/MOVA-720p input_image code code code code code code
Wan-AI/Wan2.2-Dancer-14B (global model) wantodance_music_path, wantodance_reference_image, wantodance_fps, wantodance_keyframes, wantodance_keyframes_mask code code code code code code
Wan-AI/Wan2.2-Dancer-14B (local model) wantodance_music_path, wantodance_reference_image, wantodance_fps, wantodance_keyframes, wantodance_keyframes_mask code code code code code code

Audio Synthesis

ACE-Step: /docs/en/Model_Details/ACE-Step.md

Quick Start

Running the following code will quickly load the ACE-Step/Ace-Step1.5 model and perform inference. VRAM management is enabled, and the framework will automatically control the loading of model parameters based on available VRAM. The model can run with a minimum of 3GB VRAM.

from diffsynth.pipelines.ace_step import AceStepPipeline, ModelConfig from diffsynth.utils.data.audio import save_audio import torch

vram_config = { "offload_dtype": torch.bfloat16, "offload_device": "cpu", "onload_dtype": torch.bfloat16, "onload_device": "cpu", "preparing_dtype": torch.bfloat16, "preparing_device": "cuda", "computation_dtype": torch.bfloat16, "computation_device": "cuda", }

pipe = AceStepPipeline.from_pretrained( torch_dtype=torch.bfloat16, device="cuda", model_configs=[ ModelConfig(model_id="ACE-Step/Ace-Step1.5", origin_file_pattern="acestep-v15-turbo/model.safetensors", **vram_config), ModelConfig(model_id="ACE-Step/Ace-Step1.5", origin_file_pattern="Qwen3-Embedding-0.6B/model.safetensors", **vram_config), ModelConfig(model_id="ACE-Step/Ace-Step1.5", origin_file_pattern="vae/diffusion_pytorch_model.safetensors", **vram_config), ], text_tokenizer_config=ModelConfig(model_id="ACE-Step/Ace-Step1.5", origin_file_pattern="Qwen3-Embedding-0.6B/"), vram_limit=torch.cuda.mem_get_info("cuda")[1] / (1024 ** 3) - 0.5, )

prompt = "An explosive, high-energy pop-rock track with a strong anime theme song feel. The song kicks off with a catchy, synthesized brass fanfare over a driving rock beat with punchy drums and a solid bassline. A powerful, clear male vocal enters with a theatrical and energetic delivery, soaring through the verses and hitting powerful high notes in the chorus. The arrangement is dense and dynamic, featuring rhythmic electric guitar chords, brief instrumental breaks with synth flourishes, and a consistent, danceable groove throughout. The overall mood is triumphant, adventurous, and exhilarating." lyrics = '[Intro - Synth Brass Fanfare]\n\n[Verse 1]\n黑夜里的风吹过耳畔\n甜蜜时光转瞬即万\n脚步飘摇在星光上\n心追节奏心跳狂乱\n耳边传来电吉他呼唤\n手指轻触碰点流点燃\n梦在云端任它蔓延\n疯狂跳跃自由无间\n\n[Chorus]\n心电感应在震动间\n拥抱未来勇敢冒险\n那旋律在心中无限\n世界变得如此耀眼\n\n[Instrumental Break - Synth Brass Melody]\n\n[Verse 2]\n鼓点撞击黑夜的底端\n跳动节拍连接你我俩\n在这里让灵魂发光\n燃尽所有不留遗憾\n\n[Instrumental Break - Synth Brass Melody]\n\n[Bridge]\n光影交错彼此的视线\n霓虹之下夜空的蔚蓝\n月光洒下温热心田\n追逐梦想它不会遥远\n\n[Chorus]\n心电感应在震动间\n拥抱未来勇敢冒险\n那旋律在心中无限\n世界变得如此耀眼\n\n[Outro - Instrumental with Synth Brass Melody]\n[Song ends abruptly]' audio = pipe( prompt=prompt, lyrics=lyrics, duration=160, bpm=100, keyscale="B minor", timesignature="4", vocal_language="zh", seed=42, )

save_audio(audio, pipe.vae.sampling_rate, "acestep-v15-turbo.wav")

Examples

Example code for ACE-Step is available at: /examples/ace_step/

Model ID Inference Low VRAM Inference Full Training Full Training Validation LoRA Training LoRA Training Validation
ACE-Step/Ace-Step1.5 code code code code code code
ACE-Step/acestep-v15-turbo-shift1 code code code code code code
ACE-Step/acestep-v15-turbo-shift3 code code code code code code
ACE-Step/acestep-v15-turbo-continuous code code code code code code
ACE-Step/acestep-v15-base code code code code code code
ACE-Step/acestep-v15-base: CoverTask code code
ACE-Step/acestep-v15-base: RepaintTask code code
ACE-Step/acestep-v15-sft code code code code code code
ACE-Step/acestep-v15-xl-base code code code code code code
ACE-Step/acestep-v15-xl-sft code code code code code code
ACE-Step/acestep-v15-xl-turbo code code code code code code
DiffSynth-Studio/acestep15xlsft-lora-music code code code code - -

Image Quality Metrics Models

/docs/en/Model_Details/Image-Quality-Metrics.md

Quick Start

Run the following code to quickly load PickScore and evaluate an image against a text prompt. The default model will be downloaded from ModelScope to ./models.

from diffsynth.metrics import PickScoreMetric, ModelConfig from modelscope import dataset_snapshot_download from PIL import Image

dataset_snapshot_download( "DiffSynth-Studio/diffsynth_example_dataset", allow_file_pattern="flux/FLUX.1-dev/*", local_dir="./data/diffsynth_example_dataset", ) image = Image.open("data/diffsynth_example_dataset/flux/FLUX.1-dev/1.jpg").convert("RGB") prompt = "a dog" metric = PickScoreMetric.from_pretrained( model_config=ModelConfig(model_id="DiffSynth-Studio/ImageMetrics", origin_file_pattern="PickScore/model.safetensors"), device="cuda" ) score = metric.compute(prompt, image)[0] print(f"PickScore score:: {score:.3f}")

Example Code

Example code for image quality metrics models can be found at: /examples/image_quality_metric/

Metric GitHub Repository Example Code
PickScore GitHub code
ImageReward GitHub code
HPSv2 GitHub code
HPSv3 GitHub code
CLIP Score GitHub code
Aesthetic GitHub code
FID GitHub code

Innovative Achievements

DiffSynth-Studio is not just an engineered model framework, but also an incubator for innovative achievements.

Spectral Evolution Search: Efficient Inference-Time Scaling for Reward-Aligned Image Generation

FLUX.1-dev FLUX.1-dev + SES Qwen-Image Qwen-Image + SES
Image Image Image Image

VIRAL: Visual In-Context Reasoning via Analogy in Diffusion Transformers

Example 1 Example 2 Query Output
Image Image Image Image

AttriCtrl: Attribute Intensity Control for Image Generation Models

brightness scale = 0.1 brightness scale = 0.3 brightness scale = 0.5 brightness scale = 0.7 brightness scale = 0.9
Image Image Image Image Image

AutoLoRA: Automated LoRA Retrieval and Fusion

LoRA 1 LoRA 2 LoRA 3 LoRA 4
LoRA 1 Image Image Image Image
LoRA 2 Image Image Image Image
LoRA 3 Image Image Image Image
LoRA 4 Image Image Image Image

Nexus-Gen: Unified Architecture for Image Understanding, Generation, and Editing

ArtAug: Aesthetic Enhancement for Image Generation Models

FLUX.1-dev FLUX.1-dev + ArtAug LoRA
image_1_base image_1_enhance

EliGen: Precise Image Partition Control

Entity Control Region Generated Image
eligen_example_2_mask_0 eligen_example_2_0

ExVideo: Extended Training for Video Generation Models

Contact Us

Discord:https://discord.gg/Mm9suEeUDc
Image