GitHub - modelscope/DiffSynth-Studio: Enjoy the magic of Diffusion models! (original) (raw)

Introduction

DiffSynth-Studio Documentation: 中文版、English version

Welcome to the magical world of Diffusion models! DiffSynth-Studio is an open-source Diffusion model engine developed and maintained by the ModelScope Community. We hope to foster technological innovation through framework construction, aggregate the power of the open-source community, and explore the boundaries of generative model technology!

DiffSynth currently includes two open-source projects:

DiffSynth-Studio: Focused on aggressive technical exploration, targeting academia, and providing cutting-edge model capability support.
DiffSynth-Engine: Focused on stable model deployment, targeting industry, and providing higher computational performance and more stable features.

DiffSynth-Studio and DiffSynth-Engine are the core engines of the ModelScope AIGC zone. Welcome to experience our carefully crafted productized features:

ModelScope AIGC Zone (for Chinese users): https://modelscope.cn/aigc/home
ModelScope Civision (for global users): https://modelscope.ai/civision/home

We believe that a well-developed open-source code framework can lower the threshold for technical exploration. We have achieved many interesting technologies based on this codebase. Perhaps you also have many wild ideas, and with DiffSynth-Studio, you can quickly realize these ideas. For this reason, we have prepared detailed documentation for developers. We hope that through these documents, developers can understand the principles of Diffusion models, and we look forward to expanding the boundaries of technology together with you.

Update History

DiffSynth-Studio has undergone major version updates, and some old features are no longer maintained. If you need to use old features, please switch to the last historical version before the major version update.

Currently, the development personnel of this project are limited, with most of the work handled by Artiprocher and mi804. Therefore, the progress of new feature development will be relatively slow, and the speed of responding to and resolving issues is limited. We apologize for this and ask developers to understand.

June 16, 2026: We have added a new Template model for ACE-Step: vocals2music. For more details, please refer to the documentation and example code.
June 15, 2026 We have open-sourced Image-to-LoRA V2, compressing the hours-long training process for image style LoRAs into a single model inference step, thereby exploring a new paradigm for LoRA model training. The technical report has been released. This release includes three models:
- DiffSynth-Studio/ZImage-i2L-v2: Adapted for the Z-Image model
- DiffSynth-Studio/KleinBase4B-i2L-v2: Adapted for the FLUX.2-klein-base-4B model
- DiffSynth-Studio/HidreamO1-i2L-v2: Adapted for the Hidream-O1-Image model
June 5, 2026 Ideogram 4 open-sourced. Support includes text-to-image inference. For details, please refer to the documentation and example code.
May 21, 2026: Added support for image quality metrics models, including FID, CLIP, Aesthetic, PickScore, ImageReward, HPSv2, and HPSv3. For details, refer to the documentation and example code.
May 18, 2026 Added CPU Offload Training support. By moving model weights layer-by-layer between CPU and GPU, it significantly reduces GPU VRAM usage during training, enabling LoRA training of large models even on consumer-grade GPUs, compatible with all models. Simply add --enable_model_cpu_offload to your training command to enable (currently supports single-GPU training only). For details, see the documentation.
May 14, 2026 HiDream-O1-Image open-sourced, welcome a new member to the image model family! Support includes text-to-image generation, image editing, low VRAM inference, and training capabilities. For details, please refer to the documentation and example code.
April 28, 2026 🔥 We are excited to announce the release of Diffusion Templates, a plugin framework designed for Diffusion models that significantly lowers the barrier to training controllable generative models. Let's explore this cutting-edge technology together!
- Open-source code: DiffSynth-Studio
- Technical report: arXiv
- Project homepage: GitHub
- Documentation: English Version | Chinese Version
- Online demo: ModelScope
- Model collections: ModelScope | ModelScope International | HuggingFace
- Datasets: ModelScope | ModelScope International | HuggingFace
April 27, 2026 We support ACE-Step-1.5! Support includes text-to-music generation, low VRAM inference, and LoRA training capabilities. For details, please refer to the documentation and example code.
April 27, 2026: We have reinstated support for the Stable Diffusion v1.5 and SDXL models, providing academic research support exclusively for these two model types.
April 14, 2026 JoyAI-Image open-sourced, welcome a new member to the image editing model family! Support includes instruction-guided image editing, low VRAM inference, and training capabilities. For details, please refer to the documentation and example code. More
March 19, 2026: Added support for openmoss/MOVA-720p and openmoss/MOVA-360p models, including training and inference capabilities. Documentation and example code are now available.
March 12, 2026: We have added support for the LTX-2.3 audio-video generation model. The features includes text-to-audio/video, image-to-audio/video, IC-LoRA control, audio-to-video, and audio-video inpainting. We have supported the complete inference and training functionalities. For details, please refer to the documentation and code.
March 3, 2026: We released the DiffSynth-Studio/Qwen-Image-Layered-Control-V2 model, which is an updated version of Qwen-Image-Layered-Control. In addition to the originally supported text-guided functionality, it adds brush-controlled layer separation capabilities.
March 2, 2026 Added support for Anima. For details, please refer to the documentation. This is an interesting anime-style image generation model. We look forward to its future updates.
February 26, 2026 Added full and lora training support for the LTX-2 audio-video generation model. See the documentation for details.
February 10, 2026 Added inference support for the LTX-2 audio-video generation model. See the documentation for details. Support for model training will be implemented in the future.
February 2, 2026 The first document of the Research Tutorial series is now available, guiding you through training a small 0.1B text-to-image model from scratch. For details, see the documentation and model. We hope DiffSynth-Studio can evolve into a more powerful training framework for Diffusion models.
January 27, 2026: Z-Image is released, and our Z-Image-i2L model is released concurrently. You can use it in ModelScope Studios. For details, see the documentation.
January 19, 2026: Added support for FLUX.2-klein-4B and FLUX.2-klein-9B models, including training and inference capabilities. Documentation and example code are now available.
January 12, 2026: We trained and open-sourced a text-guided image layer separation model (Model Link). Given an input image and a textual description, the model isolates the image layer corresponding to the described content. For more details, please refer to our blog post (Chinese version, English version).
December 24, 2025: Based on Qwen-Image-Edit-2511, we trained an In-Context Editing LoRA model (Model Link). This model takes three images as input (Image A, Image B, and Image C), and automatically analyzes the transformation from Image A to Image B, then applies the same transformation to Image C to generate Image D. For more details, please refer to our blog post (Chinese version, English version).
December 9, 2025 We release a wild model based on DiffSynth-Studio 2.0: Qwen-Image-i2L (Image-to-LoRA). This model takes an image as input and outputs a LoRA. Although this version still has significant room for improvement in terms of generalization, detail preservation, and other aspects, we are open-sourcing these models to inspire more innovative research. For more details, please refer to our blog.
December 4, 2025 DiffSynth-Studio 2.0 released! Many new features online
- Documentation online: Our documentation is still continuously being optimized and updated
- VRAM Management module upgraded, supporting layer-level disk offload, releasing both memory and VRAM simultaneously
- New model support
  * Z-Image Turbo: Model, Documentation, Code
  * FLUX.2-dev: Model, Documentation, Code
- Training framework upgrade
  * Split Training: Supports automatically splitting the training process into two stages: data processing and training (even for training ControlNet or any other model). Computations that do not require gradient backpropagation, such as text encoding and VAE encoding, are performed during the data processing stage, while other computations are handled during the training stage. Faster speed, less VRAM requirement.
  * Differential LoRA Training: This is a training technique we used in ArtAug, now available for LoRA training of any model.
  * FP8 Training: FP8 can be applied to any non-training model during training, i.e., models with gradients turned off or gradients that only affect LoRA weights.
November 4, 2025 Supported the ByteDance/Video-As-Prompt-Wan2.1-14B model, which is trained based on Wan 2.1 and supports generating corresponding actions based on reference videos.
October 30, 2025 Supported the meituan-longcat/LongCat-Video model, which supports text-to-video, image-to-video, and video continuation. This model uses the Wan framework for inference and training in this project.
October 27, 2025 Supported the krea/krea-realtime-video model, adding another member to the Wan model ecosystem.
September 23, 2025 DiffSynth-Studio/Qwen-Image-EliGen-Poster released! This model was jointly developed and open-sourced by us and Taobao Experience Design Team. Built upon Qwen-Image, the model is specifically designed for e-commerce poster scenarios, supporting precise partition layout control. Please refer to our sample code.
September 9, 2025 Our training framework supports various training modes. Currently adapted for Qwen-Image, in addition to the standard SFT training mode, Direct Distill is now supported. Please refer to our sample code. This feature is experimental, and we will continue to improve it to support more comprehensive model training functions.
August 28, 2025 We support Wan2.2-S2V, an audio-driven cinematic video generation model. See ./examples/wanvideo/.
August 21, 2025 DiffSynth-Studio/Qwen-Image-EliGen-V2 released! Compared to the V1 version, the training dataset has been changed to Qwen-Image-Self-Generated-Dataset, so the generated images better conform to Qwen-Image's own image distribution and style. Please refer to our sample code.
August 21, 2025 We open-sourced the DiffSynth-Studio/Qwen-Image-In-Context-Control-Union structural control LoRA model, adopting the In Context technical route, supporting multiple categories of structural control conditions, including canny, depth, lineart, softedge, normal, and openpose. Please refer to our sample code.
August 20, 2025 We open-sourced the DiffSynth-Studio/Qwen-Image-Edit-Lowres-Fix model, improving the editing effect of Qwen-Image-Edit on low-resolution image inputs. Please refer to our sample code
August 19, 2025 Qwen-Image-Edit open-sourced, welcome a new member to the image editing model family!
August 18, 2025 We trained and open-sourced the Qwen-Image inpainting ControlNet model DiffSynth-Studio/Qwen-Image-Blockwise-ControlNet-Inpaint. The model structure adopts a lightweight design. Please refer to our sample code.
August 15, 2025 We open-sourced the Qwen-Image-Self-Generated-Dataset dataset. This is an image dataset generated using the Qwen-Image model, containing 160,000 1024 x 1024 images. It includes general, English text rendering, and Chinese text rendering subsets. We provide annotations for image descriptions, entities, and structural control images for each image. Developers can use this dataset to train Qwen-Image models' ControlNet and EliGen models. We aim to promote technological development through open-sourcing!
August 13, 2025 We trained and open-sourced the Qwen-Image ControlNet model DiffSynth-Studio/Qwen-Image-Blockwise-ControlNet-Depth. The model structure adopts a lightweight design. Please refer to our sample code.
August 12, 2025 We trained and open-sourced the Qwen-Image ControlNet model DiffSynth-Studio/Qwen-Image-Blockwise-ControlNet-Canny. The model structure adopts a lightweight design. Please refer to our sample code.
August 11, 2025 We open-sourced the distilled acceleration model DiffSynth-Studio/Qwen-Image-Distill-LoRA for Qwen-Image, following the same training process as DiffSynth-Studio/Qwen-Image-Distill-Full, but the model structure has been modified to LoRA, thus being better compatible with other open-source ecosystem models.
August 7, 2025 We open-sourced the entity control LoRA model DiffSynth-Studio/Qwen-Image-EliGen for Qwen-Image. Qwen-Image-EliGen can achieve entity-level controlled text-to-image generation. Technical details can be found in the paper. Training dataset: EliGenTrainSet.
August 5, 2025 We open-sourced the distilled acceleration model DiffSynth-Studio/Qwen-Image-Distill-Full for Qwen-Image, achieving approximately 5x acceleration.
August 4, 2025 Qwen-Image open-sourced, welcome a new member to the image generation model family!
August 1, 2025 FLUX.1-Krea-dev open-sourced, a text-to-image model focused on aesthetic photography. We provided comprehensive support in a timely manner, including low VRAM layer-by-layer offload, LoRA training, and full training. For more details, please refer to ./examples/flux/.
July 28, 2025 Wan 2.2 open-sourced. We provided comprehensive support in a timely manner, including low VRAM layer-by-layer offload, FP8 quantization, sequence parallelism, LoRA training, and full training. For more details, please refer to ./examples/wanvideo/.
July 11, 2025 We propose Nexus-Gen, a unified framework that combines the language reasoning capabilities of Large Language Models (LLMs) with the image generation capabilities of diffusion models. This framework supports seamless image understanding, generation, and editing tasks.
- Paper: Nexus-Gen: Unified Image Understanding, Generation, and Editing via Prefilled Autoregression in Shared Embedding Space
- GitHub Repository: https://github.com/modelscope/Nexus-Gen
- Model: ModelScope, HuggingFace
- Training Dataset: ModelScope Dataset
- Online Experience: ModelScope Nexus-Gen Studio
June 15, 2025 ModelScope's official evaluation framework EvalScope now supports text-to-image generation evaluation. Please refer to the best practices guide to try it out.
March 25, 2025 Our new open-source project DiffSynth-Engine is now open-sourced! Focused on stable model deployment, targeting industry, providing better engineering support, higher computational performance, and more stable features.
March 31, 2025 We support InfiniteYou, a face feature preservation method for FLUX. More details can be found in ./examples/InfiniteYou/.
March 13, 2025 We support HunyuanVideo-I2V, the image-to-video generation version of Tencent's open-source HunyuanVideo. More details can be found in ./examples/HunyuanVideo/.
February 25, 2025 We support Wan-Video, a series of state-of-the-art video synthesis models open-sourced by Alibaba. See ./examples/wanvideo/.
February 17, 2025 We support StepVideo! Advanced video synthesis model! See ./examples/stepvideo.
December 31, 2024 We propose EliGen, a new framework for entity-level controlled text-to-image generation, supplemented with an inpainting fusion pipeline, extending its capabilities to image inpainting tasks. EliGen can seamlessly integrate existing community models such as IP-Adapter and In-Context LoRA, enhancing their versatility. For more details, see ./examples/EntityControl.
- Paper: EliGen: Entity-Level Controlled Image Generation with Regional Attention
- Model: ModelScope, HuggingFace
- Online Experience: ModelScope EliGen Studio
- Training Dataset: EliGen Train Set
December 19, 2024 We implemented advanced VRAM management for HunyuanVideo, enabling video generation with resolutions of 129x720x1280 on 24GB VRAM or 129x512x384 on just 6GB VRAM. More details can be found in ./examples/HunyuanVideo/.
December 18, 2024 We propose ArtAug, a method to improve text-to-image models through synthesis-understanding interaction. We trained an ArtAug enhancement module for FLUX.1-dev in LoRA format. This model incorporates the aesthetic understanding of Qwen2-VL-72B into FLUX.1-dev, thereby improving the quality of generated images.
- Paper: https://arxiv.org/abs/2412.12888
- Example: https://github.com/modelscope/DiffSynth-Studio/tree/main/examples/ArtAug
- Model: ModelScope, HuggingFace
- Demo: ModelScope, HuggingFace (coming soon)
October 25, 2024 We provide extensive FLUX ControlNet support. This project supports many different ControlNet models and can be freely combined, even if their structures are different. Additionally, ControlNet models are compatible with high-resolution optimization and partition control technologies, enabling very powerful controllable image generation. See ./examples/ControlNet/.
October 8, 2024 We released extended LoRAs based on CogVideoX-5B and ExVideo. You can download this model from ModelScope or HuggingFace.
August 22, 2024 This project now supports CogVideoX-5B. See here. We provide several interesting features for this text-to-video model, including:
- Text-to-video
- Video editing
- Self super-resolution
- Video interpolation
August 22, 2024 We implemented an interesting brush feature that supports all text-to-image models. Now you can create stunning images with the assistance of AI using the brush!
- Use it in our WebUI.
August 21, 2024 DiffSynth-Studio now supports FLUX.
- Enable CFG and high-resolution inpainting to improve visual quality. See here
- LoRA, ControlNet, and other addon models will be released soon.
June 21, 2024 We propose ExVideo, a post-training fine-tuning technique aimed at enhancing the capabilities of video generation models. We extended Stable Video Diffusion to achieve long video generation of up to 128 frames.
- Project Page
- Source code has been released in this repository. See examples/ExVideo.
- Model has been released at HuggingFace and ModelScope.
- Technical report has been released at arXiv.
- You can try ExVideo in this demo!
June 13, 2024 DiffSynth Studio has migrated to ModelScope. The development team has also transitioned from "me" to "us". Of course, I will still participate in subsequent development and maintenance work.
January 29, 2024 We propose Diffutoon, an excellent cartoon coloring solution.
- Project Page
- Source code has been released in this project.
- Technical report (IJCAI 2024) has been released at arXiv.
December 8, 2023 We decided to initiate a new project aimed at unleashing the potential of diffusion models, especially in video synthesis. The development work of this project officially began.
November 15, 2023 We propose FastBlend, a powerful video deflickering algorithm.
- sd-webui extension has been released at GitHub.
- Demonstration videos have been showcased on Bilibili, including three tasks:
  * Video Deflickering
  * Video Interpolation
  * Image-Driven Video Rendering
- Technical report has been released at arXiv.
- Unofficial ComfyUI extensions developed by other users have been released at GitHub.
October 1, 2023 We released an early version of the project named FastSDXL. This was an initial attempt to build a diffusion engine.
- Source code has been released at GitHub.
- FastSDXL includes a trainable OLSS scheduler to improve efficiency.
  * The original repository of OLSS is located here.
  * Technical report (CIKM 2023) has been released at arXiv.
  * Demonstration video has been released at Bilibili.
  * Since OLSS requires additional training, we did not implement it in this project.
August 29, 2023 We propose DiffSynth, a video synthesis framework.
- Project Page.
- Source code has been released at EasyNLP.
- Technical report (ECML PKDD 2024) has been released at arXiv.

Installation

Install from source (recommended):

git clone https://github.com/modelscope/DiffSynth-Studio.git  
cd DiffSynth-Studio
pip install -e .

For more installation methods and instructions for non-NVIDIA GPUs, please refer to the Installation Guide.

Basic Framework

DiffSynth-Studio redesigns the inference and training pipelines for mainstream Diffusion models (including FLUX, Wan, etc.), enabling efficient memory management and flexible model training.

Environment Variable Configuration

Before running model inference or training, you can configure settings such as the model download source via environment variables.

By default, this project downloads models from ModelScope. For users outside China, you can configure the system to download models from the ModelScope international site as follows:

import os os.environ["MODELSCOPE_DOMAIN"] = "www.modelscope.ai"

To download models from other sources, please modify the environment variable DIFFSYNTH_DOWNLOAD_SOURCE.

Image Synthesis

Z-Image: /docs/en/Model_Details/Z-Image.md

Quick Start

Running the following code will quickly load the Tongyi-MAI/Z-Image-Turbo model for inference. FP8 quantization significantly degrades image quality, so we do not recommend enabling any quantization for the Z-Image Turbo model. CPU offloading is recommended, and the model can run with as little as 8 GB of GPU memory.

from diffsynth.pipelines.z_image import ZImagePipeline, ModelConfig import torch

vram_config = { "offload_dtype": torch.bfloat16, "offload_device": "cpu", "onload_dtype": torch.bfloat16, "onload_device": "cpu", "preparing_dtype": torch.bfloat16, "preparing_device": "cuda", "computation_dtype": torch.bfloat16, "computation_device": "cuda", } pipe = ZImagePipeline.from_pretrained( torch_dtype=torch.bfloat16, device="cuda", model_configs=[ ModelConfig(model_id="Tongyi-MAI/Z-Image-Turbo", origin_file_pattern="transformer/*.safetensors", *vram_config), ModelConfig(model_id="Tongyi-MAI/Z-Image-Turbo", origin_file_pattern="text_encoder/.safetensors", **vram_config), ModelConfig(model_id="Tongyi-MAI/Z-Image-Turbo", origin_file_pattern="vae/diffusion_pytorch_model.safetensors", **vram_config), ], tokenizer_config=ModelConfig(model_id="Tongyi-MAI/Z-Image-Turbo", origin_file_pattern="tokenizer/"), vram_limit=torch.cuda.mem_get_info("cuda")[1] / (1024 ** 3) - 0.5, ) prompt = "Young Chinese woman in red Hanfu, intricate embroidery. Impeccable makeup, red floral forehead pattern. Elaborate high bun, golden phoenix headdress, red flowers, beads. Holds round folding fan with lady, trees, bird. Neon lightning-bolt lamp (⚡️), bright yellow glow, above extended left palm. Soft-lit outdoor night background, silhouetted tiered pagoda (西安大雁塔), blurred colorful distant lights." image = pipe(prompt=prompt, seed=42, rand_device="cuda") image.save("image.jpg")

Examples

Example code for Z-Image is available at: /examples/z_image/

Model ID	Inference	Low VRAM Inference	Full Training	Validation After Full Training	LoRA Training	Validation After LoRA Training
Tongyi-MAI/Z-Image	code	code	code	code	code	code
DiffSynth-Studio/Z-Image-i2L	code	code	-	-	-	-
Tongyi-MAI/Z-Image-Turbo	code	code	code	code	code	code
PAI/Z-Image-Turbo-Fun-Controlnet-Union-2.1	code	code	code	code	code	code
PAI/Z-Image-Turbo-Fun-Controlnet-Union-2.1-8steps	code	code	code	code	code	code
PAI/Z-Image-Turbo-Fun-Controlnet-Tile-2.1-8steps	code	code	code	code	code	code
DiffSynth-Studio/ZImage-i2L-v2	code	code	code	code	-	-

Stable Diffusion: /docs/en/Model_Details/Stable-Diffusion.md

Quick Start

Running the following code will quickly load the AI-ModelScope/stable-diffusion-v1-5 model for inference. VRAM management is enabled, the framework automatically controls parameter loading based on available VRAM, requiring a minimum of 2GB VRAM.

import torch from diffsynth.core import ModelConfig from diffsynth.pipelines.stable_diffusion import StableDiffusionPipeline

vram_config = { "offload_dtype": torch.float32, "offload_device": "cpu", "onload_dtype": torch.float32, "onload_device": "cpu", "preparing_dtype": torch.float32, "preparing_device": "cuda", "computation_dtype": torch.float32, "computation_device": "cuda", } pipe = StableDiffusionPipeline.from_pretrained( torch_dtype=torch.float32, model_configs=[ ModelConfig(model_id="AI-ModelScope/stable-diffusion-v1-5", origin_file_pattern="text_encoder/model.safetensors", **vram_config), ModelConfig(model_id="AI-ModelScope/stable-diffusion-v1-5", origin_file_pattern="unet/diffusion_pytorch_model.safetensors", **vram_config), ModelConfig(model_id="AI-ModelScope/stable-diffusion-v1-5", origin_file_pattern="vae/diffusion_pytorch_model.safetensors", **vram_config), ], tokenizer_config=ModelConfig(model_id="AI-ModelScope/stable-diffusion-v1-5", origin_file_pattern="tokenizer/"), vram_limit=torch.cuda.mem_get_info("cuda")[1] / (1024 ** 3) - 0.5, )

image = pipe( prompt="a photo of an astronaut riding a horse on mars, high quality, detailed", negative_prompt="blurry, low quality, deformed", cfg_scale=7.5, height=512, width=512, seed=42, rand_device="cuda", num_inference_steps=50, ) image.save("image.jpg")

Examples

Example code for Stable Diffusion is available at: /examples/stable_diffusion/

Model ID	Inference	Low VRAM Inference	Full Training	Full Training Validation	LoRA Training	LoRA Training Validation
AI-ModelScope/stable-diffusion-v1-5	code	code	code	code	code	code

Stable Diffusion XL: /docs/en/Model_Details/Stable-Diffusion-XL.md

Quick Start

Running the following code will quickly load the stabilityai/stable-diffusion-xl-base-1.0 model for inference. VRAM management is enabled, the framework automatically controls parameter loading based on available VRAM, requiring a minimum of 6GB VRAM.

import torch from diffsynth.core import ModelConfig from diffsynth.pipelines.stable_diffusion_xl import StableDiffusionXLPipeline

vram_config = { "offload_dtype": torch.float32, "offload_device": "cpu", "onload_dtype": torch.float32, "onload_device": "cpu", "preparing_dtype": torch.float32, "preparing_device": "cuda", "computation_dtype": torch.float32, "computation_device": "cuda", } pipe = StableDiffusionXLPipeline.from_pretrained( torch_dtype=torch.float32, model_configs=[ ModelConfig(model_id="stabilityai/stable-diffusion-xl-base-1.0", origin_file_pattern="text_encoder/model.safetensors", **vram_config), ModelConfig(model_id="stabilityai/stable-diffusion-xl-base-1.0", origin_file_pattern="text_encoder_2/model.safetensors", **vram_config), ModelConfig(model_id="stabilityai/stable-diffusion-xl-base-1.0", origin_file_pattern="unet/diffusion_pytorch_model.safetensors", **vram_config), ModelConfig(model_id="stabilityai/stable-diffusion-xl-base-1.0", origin_file_pattern="vae/diffusion_pytorch_model.safetensors", **vram_config), ], tokenizer_config=ModelConfig(model_id="stabilityai/stable-diffusion-xl-base-1.0", origin_file_pattern="tokenizer/"), tokenizer_2_config=ModelConfig(model_id="stabilityai/stable-diffusion-xl-base-1.0", origin_file_pattern="tokenizer_2/"), vram_limit=torch.cuda.mem_get_info("cuda")[1] / (1024 ** 3) - 0.5, )

image = pipe( prompt="a photo of an astronaut riding a horse on mars", negative_prompt="", cfg_scale=5.0, height=1024, width=1024, seed=42, num_inference_steps=50, ) image.save("image.jpg")

Examples

Example code for Stable Diffusion XL is available at: /examples/stable_diffusion_xl/

Model ID	Inference	Low VRAM Inference	Full Training	Full Training Validation	LoRA Training	LoRA Training Validation
stabilityai/stable-diffusion-xl-base-1.0	code	code	code	code	code	code

FLUX.2: /docs/en/Model_Details/FLUX2.md

Quick Start

Running the following code will quickly load the black-forest-labs/FLUX.2-dev model for inference. VRAM management is enabled, and the framework automatically loads model parameters based on available GPU memory. The model can run with as little as 10 GB of VRAM.

from diffsynth.pipelines.flux2_image import Flux2ImagePipeline, ModelConfig import torch

vram_config = { "offload_dtype": "disk", "offload_device": "disk", "onload_dtype": torch.float8_e4m3fn, "onload_device": "cpu", "preparing_dtype": torch.float8_e4m3fn, "preparing_device": "cuda", "computation_dtype": torch.bfloat16, "computation_device": "cuda", } pipe = Flux2ImagePipeline.from_pretrained( torch_dtype=torch.bfloat16, device="cuda", model_configs=[ ModelConfig(model_id="black-forest-labs/FLUX.2-dev", origin_file_pattern="text_encoder/*.safetensors", *vram_config), ModelConfig(model_id="black-forest-labs/FLUX.2-dev", origin_file_pattern="transformer/.safetensors", **vram_config), ModelConfig(model_id="black-forest-labs/FLUX.2-dev", origin_file_pattern="vae/diffusion_pytorch_model.safetensors"), ], tokenizer_config=ModelConfig(model_id="black-forest-labs/FLUX.2-dev", origin_file_pattern="tokenizer/"), vram_limit=torch.cuda.mem_get_info("cuda")[1] / (1024 ** 3) - 0.5, ) prompt = "High resolution. A dreamy underwater portrait of a serene young woman in a flowing blue dress. Her hair floats softly around her face, strands delicately suspended in the water. Clear, shimmering light filters through, casting gentle highlights, while tiny bubbles rise around her. Her expression is calm, her features finely detailed—creating a tranquil, ethereal scene." image = pipe(prompt, seed=42, rand_device="cuda", num_inference_steps=50) image.save("image.jpg")

Examples

Example code for FLUX.2 is available at: /examples/flux2/

Model ID	Inference	Low-VRAM Inference	Full Training	Full Training Validation	LoRA Training	LoRA Training Validation
black-forest-labs/FLUX.2-dev	code	code	-	-	code	code
black-forest-labs/FLUX.2-klein-4B	code	code	code	code	code	code
black-forest-labs/FLUX.2-klein-9B	code	code	code	code	code	code
black-forest-labs/FLUX.2-klein-base-4B	code	code	code	code	code	code
black-forest-labs/FLUX.2-klein-base-9B	code	code	code	code	code	code
DiffSynth-Studio/Template-KleinBase4B-Aesthetic	code	code	code	code	-	-
DiffSynth-Studio/Template-KleinBase4B-Brightness	code	code	code	code	-	-
DiffSynth-Studio/Template-KleinBase4B-Age	code	code	code	code	-	-
DiffSynth-Studio/Template-KleinBase4B-ControlNet	code	code	code	code	-	-
DiffSynth-Studio/Template-KleinBase4B-Edit	code	code	code	code	-	-
DiffSynth-Studio/Template-KleinBase4B-Inpaint	code	code	code	code	-	-
DiffSynth-Studio/Template-KleinBase4B-PandaMeme	code	code	code	code	-	-
DiffSynth-Studio/Template-KleinBase4B-Sharpness	code	code	code	code	-	-
DiffSynth-Studio/Template-KleinBase4B-SoftRGB	code	code	code	code	-	-
DiffSynth-Studio/Template-KleinBase4B-Upscaler	code	code	code	code	-	-
DiffSynth-Studio/Template-KleinBase4B-ContentRef	code	code	code	code	-	-
DiffSynth-Studio/KleinBase4B-i2L-v2	code	code	code	code	-	-

Anima: /docs/en/Model_Details/Anima.md

Quick Start

Run the following code to quickly load the circlestone-labs/Anima model and perform inference. VRAM management is enabled, and the framework will automatically control the loading of model parameters based on available VRAM. The model can run with a minimum of 8GB VRAM.

from diffsynth.pipelines.anima_image import AnimaImagePipeline, ModelConfig import torch

vram_config = { "offload_dtype": "disk", "offload_device": "disk", "onload_dtype": "disk", "onload_device": "disk", "preparing_dtype": torch.bfloat16, "preparing_device": "cuda", "computation_dtype": torch.bfloat16, "computation_device": "cuda", } pipe = AnimaImagePipeline.from_pretrained( torch_dtype=torch.bfloat16, device="cuda", model_configs=[ ModelConfig(model_id="circlestone-labs/Anima", origin_file_pattern="split_files/diffusion_models/anima-preview.safetensors", **vram_config), ModelConfig(model_id="circlestone-labs/Anima", origin_file_pattern="split_files/text_encoders/qwen_3_06b_base.safetensors", **vram_config), ModelConfig(model_id="circlestone-labs/Anima", origin_file_pattern="split_files/vae/qwen_image_vae.safetensors", **vram_config), ], tokenizer_config=ModelConfig(model_id="Qwen/Qwen3-0.6B", origin_file_pattern="./"), tokenizer_t5xxl_config=ModelConfig(model_id="stabilityai/stable-diffusion-3.5-large", origin_file_pattern="tokenizer_3/"), vram_limit=torch.cuda.mem_get_info("cuda")[1] / (1024 ** 3) - 0.5, ) prompt = "Masterpiece, best quality, solo, long hair, wavy hair, silver hair, blue eyes, blue dress, medium breasts, dress, underwater, air bubble, floating hair, refraction, portrait." negative_prompt = "worst quality, low quality, monochrome, zombie, interlocked fingers, Aissist, cleavage, nsfw," image = pipe(prompt, seed=0, num_inference_steps=50) image.save("image.jpg")

Examples

Example code for Anima is located at: /examples/anima/

Model ID	Inference	Low VRAM Inference	Full Training	Validation after Full Training	LoRA Training	Validation after LoRA Training
circlestone-labs/Anima	code	code	code	code	code	code

Qwen-Image: /docs/en/Model_Details/Qwen-Image.md

Quick Start

Running the following code will quickly load the Qwen/Qwen-Image model for inference. VRAM management is enabled, and the framework automatically adjusts model parameter loading based on available GPU memory. The model can run with as little as 8 GB of VRAM.

from diffsynth.pipelines.qwen_image import QwenImagePipeline, ModelConfig import torch

vram_config = { "offload_dtype": "disk", "offload_device": "disk", "onload_dtype": torch.float8_e4m3fn, "onload_device": "cpu", "preparing_dtype": torch.float8_e4m3fn, "preparing_device": "cuda", "computation_dtype": torch.bfloat16, "computation_device": "cuda", } pipe = QwenImagePipeline.from_pretrained( torch_dtype=torch.bfloat16, device="cuda", model_configs=[ ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="transformer/diffusion_pytorch_model*.safetensors", *vram_config), ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="text_encoder/model.safetensors", **vram_config), ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="vae/diffusion_pytorch_model.safetensors", **vram_config), ], tokenizer_config=ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="tokenizer/"), vram_limit=torch.cuda.mem_get_info("cuda")[1] / (1024 ** 3) - 0.5, ) prompt = "精致肖像，水下少女，蓝裙飘逸，发丝轻扬，光影透澈，气泡环绕，面容恬静，细节精致，梦幻唯美。" image = pipe(prompt, seed=0, num_inference_steps=40) image.save("image.jpg")

Model Lineage

graph LR; Qwen/Qwen-Image-->Qwen/Qwen-Image-Edit; Qwen/Qwen-Image-Edit-->Qwen/Qwen-Image-Edit-2509; Qwen/Qwen-Image-->EliGen-Series; EliGen-Series-->DiffSynth-Studio/Qwen-Image-EliGen; DiffSynth-Studio/Qwen-Image-EliGen-->DiffSynth-Studio/Qwen-Image-EliGen-V2; EliGen-Series-->DiffSynth-Studio/Qwen-Image-EliGen-Poster; Qwen/Qwen-Image-->Distill-Series; Distill-Series-->DiffSynth-Studio/Qwen-Image-Distill-Full; Distill-Series-->DiffSynth-Studio/Qwen-Image-Distill-LoRA; Qwen/Qwen-Image-->ControlNet-Series; ControlNet-Series-->Blockwise-ControlNet-Series; Blockwise-ControlNet-Series-->DiffSynth-Studio/Qwen-Image-Blockwise-ControlNet-Canny; Blockwise-ControlNet-Series-->DiffSynth-Studio/Qwen-Image-Blockwise-ControlNet-Depth; Blockwise-ControlNet-Series-->DiffSynth-Studio/Qwen-Image-Blockwise-ControlNet-Inpaint; ControlNet-Series-->DiffSynth-Studio/Qwen-Image-In-Context-Control-Union; Qwen/Qwen-Image-->DiffSynth-Studio/Qwen-Image-Edit-Lowres-Fix;

Examples

Example code for Qwen-Image is available at: /examples/qwen_image/

Model ID	Inference	Low-VRAM Inference	Full Training	Full Training Validation	LoRA Training	LoRA Training Validation
Qwen/Qwen-Image	code	code	code	code	code	code
Qwen/Qwen-Image-2512	code	code	code	code	code	code
Qwen/Qwen-Image-Edit	code	code	code	code	code	code
Qwen/Qwen-Image-Edit-2509	code	code	code	code	code	code
Qwen/Qwen-Image-Edit-2511	code	code	code	code	code	code
FireRedTeam/FireRed-Image-Edit-1.0	code	code	code	code	code	code
FireRedTeam/FireRed-Image-Edit-1.1	code	code	code	code	code	code
lightx2v/Qwen-Image-Edit-2511-Lightning	code	code	-	-	-	-
Qwen/Qwen-Image-Layered	code	code	code	code	code	code
DiffSynth-Studio/Qwen-Image-Layered-Control	code	code	code	code	code	code
DiffSynth-Studio/Qwen-Image-Layered-Control-V2	code	code	-	-	code	code
DiffSynth-Studio/Qwen-Image-EliGen	code	code	-	-	code	code
DiffSynth-Studio/Qwen-Image-EliGen-V2	code	code	-	-	code	code
DiffSynth-Studio/Qwen-Image-EliGen-Poster	code	code	-	-	code	code
DiffSynth-Studio/Qwen-Image-Distill-Full	code	code	code	code	code	code
DiffSynth-Studio/Qwen-Image-Distill-LoRA	code	code	-	-	code	code
DiffSynth-Studio/Qwen-Image-Blockwise-ControlNet-Canny	code	code	code	code	code	code
DiffSynth-Studio/Qwen-Image-Blockwise-ControlNet-Depth	code	code	code	code	code	code
DiffSynth-Studio/Qwen-Image-Blockwise-ControlNet-Inpaint	code	code	code	code	code	code
DiffSynth-Studio/Qwen-Image-In-Context-Control-Union	code	code	-	-	code	code
DiffSynth-Studio/Qwen-Image-Edit-Lowres-Fix	code	code	-	-	-	-
DiffSynth-Studio/Qwen-Image-i2L	code	code	-	-	-	-

FLUX.1: /docs/en/Model_Details/FLUX.md

Quick Start

Running the following code will quickly load the black-forest-labs/FLUX.1-dev model for inference. VRAM management is enabled, and the framework automatically adjusts model parameter loading based on available GPU memory. The model can run with as little as 8 GB of VRAM.

import torch from diffsynth.pipelines.flux_image import FluxImagePipeline, ModelConfig

vram_config = { "offload_dtype": torch.float8_e4m3fn, "offload_device": "cpu", "onload_dtype": torch.float8_e4m3fn, "onload_device": "cpu", "preparing_dtype": torch.float8_e4m3fn, "preparing_device": "cuda", "computation_dtype": torch.bfloat16, "computation_device": "cuda", } pipe = FluxImagePipeline.from_pretrained( torch_dtype=torch.bfloat16, device="cuda", model_configs=[ ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="flux1-dev.safetensors", **vram_config), ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder/model.safetensors", *vram_config), ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="text_encoder_2/.safetensors", **vram_config), ModelConfig(model_id="black-forest-labs/FLUX.1-dev", origin_file_pattern="ae.safetensors", **vram_config), ], vram_limit=torch.cuda.mem_get_info("cuda")[1] / (1024 ** 3) - 1, ) prompt = "CG, masterpiece, best quality, solo, long hair, wavy hair, silver hair, blue eyes, blue dress, medium breasts, dress, underwater, air bubble, floating hair, refraction, portrait. The girl's flowing silver hair shimmers with every color of the rainbow and cascades down, merging with the floating flora around her." image = pipe(prompt=prompt, seed=0) image.save("image.jpg")

Model Lineage

graph LR; FLUX.1-Series-->black-forest-labs/FLUX.1-dev; FLUX.1-Series-->black-forest-labs/FLUX.1-Krea-dev; FLUX.1-Series-->black-forest-labs/FLUX.1-Kontext-dev; black-forest-labs/FLUX.1-dev-->FLUX.1-dev-ControlNet-Series; FLUX.1-dev-ControlNet-Series-->alimama-creative/FLUX.1-dev-Controlnet-Inpainting-Beta; FLUX.1-dev-ControlNet-Series-->InstantX/FLUX.1-dev-Controlnet-Union-alpha; FLUX.1-dev-ControlNet-Series-->jasperai/Flux.1-dev-Controlnet-Upscaler; black-forest-labs/FLUX.1-dev-->InstantX/FLUX.1-dev-IP-Adapter; black-forest-labs/FLUX.1-dev-->ByteDance/InfiniteYou; black-forest-labs/FLUX.1-dev-->DiffSynth-Studio/Eligen; black-forest-labs/FLUX.1-dev-->DiffSynth-Studio/LoRA-Encoder-FLUX.1-Dev; black-forest-labs/FLUX.1-dev-->DiffSynth-Studio/LoRAFusion-preview-FLUX.1-dev; black-forest-labs/FLUX.1-dev-->ostris/Flex.2-preview; black-forest-labs/FLUX.1-dev-->stepfun-ai/Step1X-Edit; Qwen/Qwen2.5-VL-7B-Instruct-->stepfun-ai/Step1X-Edit; black-forest-labs/FLUX.1-dev-->DiffSynth-Studio/Nexus-GenV2; Qwen/Qwen2.5-VL-7B-Instruct-->DiffSynth-Studio/Nexus-GenV2;

Examples

Example code for FLUX.1 is available at: /examples/flux/

Model ID	Extra Args	Inference	Low-VRAM Inference	Full Training	Full Training Validation	LoRA Training	LoRA Training Validation
black-forest-labs/FLUX.1-dev	code	code	code	code	code	code
black-forest-labs/FLUX.1-Krea-dev	code	code	code	code	code	code
black-forest-labs/FLUX.1-Kontext-dev	kontext_images	code	code	code	code	code	code
alimama-creative/FLUX.1-dev-Controlnet-Inpainting-Beta	controlnet_inputs	code	code	code	code	code	code
InstantX/FLUX.1-dev-Controlnet-Union-alpha	controlnet_inputs	code	code	code	code	code	code
jasperai/Flux.1-dev-Controlnet-Upscaler	controlnet_inputs	code	code	code	code	code	code
InstantX/FLUX.1-dev-IP-Adapter	ipadapter_images, ipadapter_scale	code	code	code	code	code	code
ByteDance/InfiniteYou	infinityou_id_image, infinityou_guidance, controlnet_inputs	code	code	code	code	code	code
DiffSynth-Studio/Eligen	eligen_entity_prompts, eligen_entity_masks, eligen_enable_on_negative, eligen_enable_inpaint	code	code	-	-	code	code
DiffSynth-Studio/LoRA-Encoder-FLUX.1-Dev	lora_encoder_inputs, lora_encoder_scale	code	code	code	code	-	-
DiffSynth-Studio/LoRAFusion-preview-FLUX.1-dev	code	-	-	-	-	-
stepfun-ai/Step1X-Edit	step1x_reference_image	code	code	code	code	code	code
ostris/Flex.2-preview	flex_inpaint_image, flex_inpaint_mask, flex_control_image, flex_control_strength, flex_control_stop	code	code	code	code	code	code
DiffSynth-Studio/Nexus-GenV2	nexus_gen_reference_image	code	code	code	code	code	code

ERNIE-Image: /docs/en/Model_Details/ERNIE-Image.md

Quick Start

Running the following code will quickly load the PaddlePaddle/ERNIE-Image model and perform inference. VRAM management is enabled, and the framework will automatically control the loading of model parameters based on available VRAM. The model can run with a minimum of 3GB VRAM.

from diffsynth.pipelines.ernie_image import ErnieImagePipeline, ModelConfig import torch

image = pipe( prompt="一只黑白相间的中华田园犬", negative_prompt="", height=1024, width=1024, seed=42, num_inference_steps=50, cfg_scale=4.0, ) image.save("output.jpg")

Examples

Example code for ERNIE-Image is available at: /examples/ernie_image/

Model ID	Inference	Low VRAM Inference	Full Training	Full Training Validation	LoRA Training	LoRA Training Validation
PaddlePaddle/ERNIE-Image	code	code	code	code	code	code
PaddlePaddle/ERNIE-Image-Turbo	code	code	—	—	—	—

JoyAI-Image: /docs/en/Model_Details/JoyAI-Image.md

Quick Start

Running the following code will quickly load the jd-opensource/JoyAI-Image-Edit model and perform inference. VRAM management is enabled, and the framework will automatically control the loading of model parameters based on available VRAM. The model can run with a minimum of 4GB VRAM.

from diffsynth.pipelines.joyai_image import JoyAIImagePipeline, ModelConfig import torch from PIL import Image from modelscope import dataset_snapshot_download

Download dataset

dataset_snapshot_download( dataset_id="DiffSynth-Studio/diffsynth_example_dataset", local_dir="data/diffsynth_example_dataset", allow_file_pattern="joyai_image/JoyAI-Image-Edit/*" )

pipe = JoyAIImagePipeline.from_pretrained( torch_dtype=torch.bfloat16, device="cuda", model_configs=[ ModelConfig(model_id="jd-opensource/JoyAI-Image-Edit", origin_file_pattern="transformer/transformer.pth", *vram_config), ModelConfig(model_id="jd-opensource/JoyAI-Image-Edit", origin_file_pattern="JoyAI-Image-Und/model.safetensors", **vram_config), ModelConfig(model_id="jd-opensource/JoyAI-Image-Edit", origin_file_pattern="vae/Wan2.1_VAE.pth", **vram_config), ], processor_config=ModelConfig(model_id="jd-opensource/JoyAI-Image-Edit", origin_file_pattern="JoyAI-Image-Und/"), vram_limit=torch.cuda.mem_get_info("cuda")[1] / (1024 ** 3) - 0.5, )

Use first sample from dataset

dataset_base_path = "data/diffsynth_example_dataset/joyai_image/JoyAI-Image-Edit" prompt = "将裙子改为粉色" edit_image = Image.open(f"{dataset_base_path}/edit/image1.jpg").convert("RGB")

output = pipe( prompt=prompt, edit_image=edit_image, height=1024, width=1024, seed=0, num_inference_steps=30, cfg_scale=5.0, )

output.save("output_joyai_edit_low_vram.png")

Examples

Example code for JoyAI-Image is available at: /examples/joyai_image/

Model ID	Inference	Low VRAM Inference	Full Training	Full Training Validation	LoRA Training	LoRA Training Validation
jd-opensource/JoyAI-Image-Edit	code	code	code	code	code	code

HiDream-O1-Image: /docs/en/Model_Details/HiDream-O1-Image.md

Quick Start

Running the following code will quickly load the HiDream-ai/HiDream-O1-Image model and perform inference. VRAM management is enabled, and the framework will automatically control the loading of model parameters based on available VRAM. The model can run with a minimum of 3GB VRAM.

from diffsynth.pipelines.hidream_o1_image import HiDreamO1ImagePipeline from diffsynth.core.loader.config import ModelConfig import torch

pipe = HiDreamO1ImagePipeline.from_pretrained( torch_dtype=torch.bfloat16, device="cuda", model_configs=[ ModelConfig(model_id="HiDream-ai/HiDream-O1-Image", origin_file_pattern="model-*.safetensors", **vram_config), ], processor_config=ModelConfig(model_id="HiDream-ai/HiDream-O1-Image", origin_file_pattern="./"), vram_limit=torch.cuda.mem_get_info("cuda")[1] / (1024 ** 3) - 0.5, ) image = pipe( prompt="medium shot, eye-level, front view. A woman is seated in an ornate bedroom, illuminated by candlelight, with a calm and composed expression. The subject is a young woman with fair skin, light brown hair styled in an updo with loose tendrils framing her face, and blue eyes. She wears a cream-colored satin robe with delicate floral embroidery and lace trim along the neckline. Her ears are adorned with pearl drop earrings. She is seated on a bed with a dark, intricately carved wooden headboard. To her left, a wooden nightstand holds three lit white candles and a candelabra with multiple lit candles in the background. The bed is covered with patterned pillows and a dark, textured blanket. The walls are paneled with dark wood and feature a large, ornate tapestry with muted earth tones. The lighting creates soft highlights on her face and robe, with warm shadows cast across the room.", negative_prompt=" ", cfg_scale=4.0, height=2048, width=2048, seed=42, num_inference_steps=50, ) image.save("image.jpg")

Examples

Example code for HiDream-O1-Image is available at: /examples/hidream_o1_image/

Model ID	Inference	Low VRAM Inference	Full Training	Full Training Validation	LoRA Training	LoRA Training Validation
HiDream-ai/HiDream-O1-Image	code	code	code	code	code	code
HiDream-ai/HiDream-O1-Image-Dev	code	code	code	code	code	code
DiffSynth-Studio/HidreamO1-i2L-v2	code	code	code	code	-	-

Ideogram 4: /docs/en/Model_Details/Ideogram-4.md

Quick Start

Running the following code will quickly load the ideogram-ai/ideogram-4-fp8 model and perform inference. The model can run with a minimum of 24GB VRAM.

from diffsynth.pipelines.ideogram4 import Ideogram4Pipeline from diffsynth.core import ModelConfig import torch

pipe = Ideogram4Pipeline.from_pretrained( torch_dtype=torch.bfloat16, device="cuda", model_configs=[ ModelConfig(model_id="ideogram-ai/ideogram-4-fp8", origin_file_pattern="transformer/diffusion_pytorch_model.safetensors"), # unconditional_transformer is optional. You can delete this line to reduce VRAM required. ModelConfig(model_id="ideogram-ai/ideogram-4-fp8", origin_file_pattern="unconditional_transformer/diffusion_pytorch_model.safetensors"), ModelConfig(model_id="ideogram-ai/ideogram-4-fp8", origin_file_pattern="text_encoder/model.safetensors"), ModelConfig(model_id="ideogram-ai/ideogram-4-fp8", origin_file_pattern="vae/diffusion_pytorch_model.safetensors"), ], tokenizer_config=ModelConfig(model_id="ideogram-ai/ideogram-4-fp8", origin_file_pattern="tokenizer/"), ) prompt = r""" { "high_level_description": "A medium-shot photograph of Formula 1 driver Max Verstappen wearing his Red Bull Racing racing suit and cap, smiling as he holds his racing helmet and talks to a man in a white shirt and black vest at a race track.", "style_description": { "aesthetics": "saturated primary colors, rule of thirds, joyful and triumphant", "lighting": "overcast daylight, diffused, soft subtle shadows", "photo": "shallow depth of field, sharp focus, eye-level, telephoto", "medium": "photograph" }, "compositional_deconstruction": { "background": "The background is an out-of-focus racing paddock or track environment. Several blurred figures are visible, including one in an orange shirt. A purple and white structure with a red 'F1' logo stands on the left. The scene is outdoors with daylight, though the sky is not visible.", "elements": [ {"type": "obj", "bbox": [55, 642, 1000, 937], "desc": "An older man standing in profile, facing left toward Max Verstappen. He has grey hair and fair skin. He is wearing a white long-sleeved button-down shirt with a navy blue quilted vest over it. He has a slight smile."}, {"type": "obj", "bbox": [34, 137, 1000, 617], "desc": "Max Verstappen, a fair-skinned male Formula 1 driver, positioned in the center. He is facing forward with a joyful expression and a slight smile. He wears a navy blue Red Bull Racing team uniform with numerous sponsor logos and a matching baseball cap with the number '1'. He is holding a white and red racing helmet in his hands. He has a silver watch on his left wrist."}, {"type": "obj", "bbox": [422, 212, 792, 452], "desc": "Max Verstappen's racing helmet, held in front of his chest. It features a white, red, and yellow design with the Red Bull logo and the 'Player 0.0' branding. The visor is clear and open."}, {"type": "text", "bbox": [657, 0, 755, 142], "text": "F1", "desc": "Large, stylized red logo on a black and purple background in the lower left."}, {"type": "text", "bbox": [768, 0, 818, 147], "text": "Formula 1\nWorld Championship™", "desc": "Small white sans-serif text below the F1 logo on the left side."}, {"type": "text", "bbox": [78, 447, 117, 510], "text": "ORACLE\nRed Bull\nRacing", "desc": "Very small white and orange logo on the front of the navy blue cap."}, {"type": "text", "bbox": [78, 417, 120, 440], "text": "1", "desc": "Bold red numeral '1' on the front left side of the navy blue cap."}, {"type": "text", "bbox": [332, 442, 363, 483], "text": "Red Bull", "desc": "Small yellow and red text logo on the collar of the uniform."}, {"type": "text", "bbox": [373, 490, 423, 532], "text": "RAUCH", "desc": "Small yellow and blue logo on the right chest of the uniform."}, {"type": "text", "bbox": [422, 473, 500, 532], "text": "BYBIT\nHONDA", "desc": "Medium-sized white sans-serif text on the right chest of the uniform."}, {"type": "text", "bbox": [410, 203, 442, 257], "text": "RAUCH", "desc": "Small yellow logo on the left upper arm of the uniform."}, {"type": "text", "bbox": [530, 448, 627, 510], "text": "Red Bull", "desc": "Medium red text logo on the right side of the torso, part of the Red Bull graphic."}, {"type": "text", "bbox": [680, 417, 768, 523], "text": "Red Bull", "desc": "Large red text logo across the lower torso of the uniform."}, {"type": "text", "bbox": [797, 475, 815, 518], "text": "MAX", "desc": "Small white text next to a Dutch flag on the belt area of the uniform."}, {"type": "text", "bbox": [558, 317, 715, 355], "text": "Player 0.0", "desc": "Black sans-serif text on a white band on the racing helmet."}, {"type": "text", "bbox": [560, 800, 582, 835], "text": "IA.COM", "desc": "Small blue sans-serif text on the right sleeve of the white shirt."}, {"type": "text", "bbox": [968, 8, 997, 332], "text": "© Anadolu Agency via Getty Images", "desc": "Small white watermark text in the bottom left corner."} ] } } """ image = pipe(prompt=prompt, height=1024, width=1024, num_inference_steps=48, cfg_scale=7.0, seed=42) image.save("image_ideogram-4-fp8.jpg")

Examples

Example code for Ideogram 4 is available at: /examples/ideogram4/

Model ID	Inference	Low VRAM Inference	Full Training	Full Training Validation	LoRA Training	LoRA Training Validation
ideogram-ai/ideogram-4-fp8	code	-	-	-	-	-
DiffSynth-Studio/ideogram-4-bf16-repackage	code	code	code	-	code	code

Video Synthesis

video1.mp4

LTX-2: /docs/en/Model_Details/LTX-2.md

Quick Start

Running the following code will quickly load the Lightricks/LTX-2 model for inference. VRAM management is enabled, and the framework automatically adjusts model parameter loading based on available GPU memory. The model can run with as little as 8GB of VRAM.

import torch from diffsynth.pipelines.ltx2_audio_video import LTX2AudioVideoPipeline, ModelConfig from diffsynth.utils.data.media_io_ltx2 import write_video_audio_ltx2

vram_config = { "offload_dtype": torch.float8_e5m2, "offload_device": "cpu", "onload_dtype": torch.float8_e5m2, "onload_device": "cpu", "preparing_dtype": torch.float8_e5m2, "preparing_device": "cuda", "computation_dtype": torch.bfloat16, "computation_device": "cuda", } """ Offical model repo: https://www.modelscope.cn/models/Lightricks/LTX-2 Repackaged model repo: https://www.modelscope.cn/models/DiffSynth-Studio/LTX-2-Repackage For base models of LTX-2, offical checkpoint (with model config ModelConfig(model_id="Lightricks/LTX-2", origin_file_pattern="ltx-2-19b-dev.safetensors")) and repackaged checkpoints (with model config ModelConfig(model_id="DiffSynth-Studio/LTX-2-Repackage", origin_file_pattern="*.safetensors")) are both supported. We have repackeged the official checkpoints in DiffSynth-Studio/LTX-2-Repackage repo to support separate loading of different submodules, and avoid redundant memory usage when users only want to use part of the model. """

use the repackaged modelconfig from "DiffSynth-Studio/LTX-2-Repackage" to avoid redundant model loading

pipe = LTX2AudioVideoPipeline.from_pretrained( torch_dtype=torch.bfloat16, device="cuda", model_configs=[ ModelConfig(model_id="google/gemma-3-12b-it-qat-q4_0-unquantized", origin_file_pattern="model-*.safetensors", **vram_config), ModelConfig(model_id="DiffSynth-Studio/LTX-2-Repackage", origin_file_pattern="transformer.safetensors", **vram_config), ModelConfig(model_id="DiffSynth-Studio/LTX-2-Repackage", origin_file_pattern="text_encoder_post_modules.safetensors", **vram_config), ModelConfig(model_id="DiffSynth-Studio/LTX-2-Repackage", origin_file_pattern="video_vae_decoder.safetensors", **vram_config), ModelConfig(model_id="DiffSynth-Studio/LTX-2-Repackage", origin_file_pattern="audio_vae_decoder.safetensors", **vram_config), ModelConfig(model_id="DiffSynth-Studio/LTX-2-Repackage", origin_file_pattern="audio_vocoder.safetensors", **vram_config), ModelConfig(model_id="DiffSynth-Studio/LTX-2-Repackage", origin_file_pattern="video_vae_encoder.safetensors", **vram_config), ModelConfig(model_id="Lightricks/LTX-2", origin_file_pattern="ltx-2-spatial-upscaler-x2-1.0.safetensors", **vram_config), ], tokenizer_config=ModelConfig(model_id="google/gemma-3-12b-it-qat-q4_0-unquantized"), stage2_lora_config=ModelConfig(model_id="Lightricks/LTX-2", origin_file_pattern="ltx-2-19b-distilled-lora-384.safetensors"), vram_limit=torch.cuda.mem_get_info("cuda")[1] / (1024 ** 3) - 0.5, )

use the following modelconfig if you want to initialize model from offical checkpoints from "Lightricks/LTX-2"

pipe = LTX2AudioVideoPipeline.from_pretrained(

torch_dtype=torch.bfloat16,

device="cuda",

model_configs=[

ModelConfig(model_id="google/gemma-3-12b-it-qat-q4_0-unquantized", origin_file_pattern="model-*.safetensors", **vram_config),

ModelConfig(model_id="Lightricks/LTX-2", origin_file_pattern="ltx-2-19b-dev.safetensors", **vram_config),

ModelConfig(model_id="Lightricks/LTX-2", origin_file_pattern="ltx-2-spatial-upscaler-x2-1.0.safetensors", **vram_config),

],

tokenizer_config=ModelConfig(model_id="google/gemma-3-12b-it-qat-q4_0-unquantized"),

stage2_lora_config=ModelConfig(model_id="Lightricks/LTX-2", origin_file_pattern="ltx-2-19b-distilled-lora-384.safetensors"),

vram_limit=torch.cuda.mem_get_info("cuda")[1] / (1024 ** 3) - 0.5,

)

prompt = "A girl is very happy, she is speaking: "I enjoy working with Diffsynth-Studio, it's a perfect framework."" negative_prompt = ( "blurry, out of focus, overexposed, underexposed, low contrast, washed out colors, excessive noise, " "grainy texture, poor lighting, flickering, motion blur, distorted proportions, unnatural skin tones, " "deformed facial features, asymmetrical face, missing facial features, extra limbs, disfigured hands, " "wrong hand count, artifacts around text, inconsistent perspective, camera shake, incorrect depth of " "field, background too sharp, background clutter, distracting reflections, harsh shadows, inconsistent " "lighting direction, color banding, cartoonish rendering, 3D CGI look, unrealistic materials, uncanny " "valley effect, incorrect ethnicity, wrong gender, exaggerated expressions, wrong gaze direction, " "mismatched lip sync, silent or muted audio, distorted voice, robotic voice, echo, background noise, " "off-sync audio, incorrect dialogue, added dialogue, repetitive speech, jittery movement, awkward " "pauses, incorrect timing, unnatural transitions, inconsistent framing, tilted camera, flat lighting, " "inconsistent tone, cinematic oversaturation, stylized filters, or AI artifacts." ) height, width, num_frames = 512 * 2, 768 * 2, 121 video, audio = pipe( prompt=prompt, negative_prompt=negative_prompt, seed=43, height=height, width=width, num_frames=num_frames, tiled=True, use_two_stage_pipeline=True, ) write_video_audio_ltx2( video=video, audio=audio, output_path='ltx2_twostage.mp4', fps=24, audio_sample_rate=24000, )

Examples

Example code for LTX-2 is available at: /examples/ltx2/

Model ID	Extra Args	Inference	Low-VRAM Inference	Full Training	Full Training Validation	LoRA Training	LoRA Training Validation
jd-opensource/JoyAI-Echo	code	code	code	code	code	code
Lightricks/LTX-2.3: OneStagePipeline-I2AV	input_images	code	code	code	code	code	code
Lightricks/LTX-2.3: TwoStagePipeline-I2AV	input_images	code	code	-	-	-	-
Lightricks/LTX-2.3: DistilledPipeline-I2AV	input_images	code	code	-	-	-	-
Lightricks/LTX-2.3: OneStagePipeline-T2AV	code	code	code	code	code	code
Lightricks/LTX-2.3: TwoStagePipeline-T2AV	code	code	-	-	-	-
Lightricks/LTX-2.3: DistilledPipeline-T2AV	code	code	-	-	-	-
Lightricks/LTX-2.3: A2V	retake_audio,audio_sample_rate,retake_audio_regions	code	code	-	-	-	-
Lightricks/LTX-2.3: Retake	retake_video,retake_video_regions,retake_audio,audio_sample_rate,retake_audio_regions	code	code	-	-	-	-
Lightricks/LTX-2.3-22b-IC-LoRA-Union-Control	in_context_videos,in_context_downsample_factor	code	code	-	-	code	code
Lightricks/LTX-2.3-22b-IC-LoRA-Motion-Track-Control	in_context_videos,in_context_downsample_factor	code	code	-	-	code	code
Lightricks/LTX-2: OneStagePipeline-T2AV	code	code	code	code	code	code
Lightricks/LTX-2-19b-IC-LoRA-Union-Control	in_context_videos,in_context_downsample_factor	code	code	-	-	code	code
Lightricks/LTX-2-19b-IC-LoRA-Detailer	in_context_videos,in_context_downsample_factor	code	code	-	-	code	code
Lightricks/LTX-2: TwoStagePipeline-T2AV	code	code	-	-	-	-
Lightricks/LTX-2: DistilledPipeline-T2AV	code	code	-	-	-	-
Lightricks/LTX-2: OneStagePipeline-I2AV	input_images	code	code	-	-	-	-
Lightricks/LTX-2: TwoStagePipeline-I2AV	input_images	code	code	-	-	-	-
Lightricks/LTX-2: DistilledPipeline-I2AV	input_images	code	code	-	-	-	-
Lightricks/LTX-2-19b-LoRA-Camera-Control-Dolly-In	code	code	-	-	-	-
Lightricks/LTX-2-19b-LoRA-Camera-Control-Dolly-Out	code	code	-	-	-	-
Lightricks/LTX-2-19b-LoRA-Camera-Control-Dolly-Left	code	code	-	-	-	-
Lightricks/LTX-2-19b-LoRA-Camera-Control-Dolly-Right	code	code	-	-	-	-
Lightricks/LTX-2-19b-LoRA-Camera-Control-Jib-Up	code	code	-	-	-	-
Lightricks/LTX-2-19b-LoRA-Camera-Control-Jib-Down	code	code	-	-	-	-
Lightricks/LTX-2-19b-LoRA-Camera-Control-Static	code	code	-	-	-	-

Wan: /docs/en/Model_Details/Wan.md

Quick Start

Running the following code will quickly load the Wan-AI/Wan2.1-T2V-1.3B model for inference. VRAM management is enabled, and the framework automatically adjusts model parameter loading based on available GPU memory. The model can run with as little as 8 GB of VRAM.

import torch from diffsynth.utils.data import save_video, VideoData from diffsynth.pipelines.wan_video import WanVideoPipeline, ModelConfig

vram_config = { "offload_dtype": "disk", "offload_device": "disk", "onload_dtype": torch.bfloat16, "onload_device": "cpu", "preparing_dtype": torch.bfloat16, "preparing_device": "cuda", "computation_dtype": torch.bfloat16, "computation_device": "cuda", } pipe = WanVideoPipeline.from_pretrained( torch_dtype=torch.bfloat16, device="cuda", model_configs=[ ModelConfig(model_id="Wan-AI/Wan2.1-T2V-1.3B", origin_file_pattern="diffusion_pytorch_model*.safetensors", **vram_config), ModelConfig(model_id="Wan-AI/Wan2.1-T2V-1.3B", origin_file_pattern="models_t5_umt5-xxl-enc-bf16.pth", **vram_config), ModelConfig(model_id="Wan-AI/Wan2.1-T2V-1.3B", origin_file_pattern="Wan2.1_VAE.pth", **vram_config), ], tokenizer_config=ModelConfig(model_id="Wan-AI/Wan2.1-T2V-1.3B", origin_file_pattern="google/umt5-xxl/"), vram_limit=torch.cuda.mem_get_info("cuda")[1] / (1024 ** 3) - 2, )

video = pipe( prompt="纪实摄影风格画面，一只活泼的小狗在绿茵茵的草地上迅速奔跑。小狗毛色棕黄，两只耳朵立起，神情专注而欢快。阳光洒在它身上，使得毛发看上去格外柔软而闪亮。背景是一片开阔的草地，偶尔点缀着几朵野花，远处隐约可见蓝天和几片白云。透视感鲜明，捕捉小狗奔跑时的动感和四周草地的生机。中景侧面移动视角。", negative_prompt="色调艳丽，过曝，静态，细节模糊不清，字幕，风格，作品，画作，画面，静止，整体发灰，最差质量，低质量，JPEG压缩残留，丑陋的，残缺的，多余的手指，画得不好的手部，画得不好的脸部，畸形的，毁容的，形态畸形的肢体，手指融合，静止不动的画面，杂乱的背景，三条腿，背景人很多，倒着走", seed=0, tiled=True, ) save_video(video, "video.mp4", fps=15, quality=5)

Model Lineage

graph LR; Wan-Series-->Wan2.1-Series; Wan-Series-->Wan2.2-Series; Wan2.1-Series-->Wan-AI/Wan2.1-T2V-1.3B; Wan2.1-Series-->Wan-AI/Wan2.1-T2V-14B; Wan-AI/Wan2.1-T2V-14B-->Wan-AI/Wan2.1-I2V-14B-480P; Wan-AI/Wan2.1-I2V-14B-480P-->Wan-AI/Wan2.1-I2V-14B-720P; Wan-AI/Wan2.1-T2V-14B-->Wan-AI/Wan2.1-FLF2V-14B-720P; Wan-AI/Wan2.1-T2V-1.3B-->iic/VACE-Wan2.1-1.3B-Preview; iic/VACE-Wan2.1-1.3B-Preview-->Wan-AI/Wan2.1-VACE-1.3B; Wan-AI/Wan2.1-T2V-14B-->Wan-AI/Wan2.1-VACE-14B; Wan-AI/Wan2.1-T2V-1.3B-->Wan2.1-Fun-1.3B-Series; Wan2.1-Fun-1.3B-Series-->PAI/Wan2.1-Fun-1.3B-InP; Wan2.1-Fun-1.3B-Series-->PAI/Wan2.1-Fun-1.3B-Control; Wan-AI/Wan2.1-T2V-14B-->Wan2.1-Fun-14B-Series; Wan2.1-Fun-14B-Series-->PAI/Wan2.1-Fun-14B-InP; Wan2.1-Fun-14B-Series-->PAI/Wan2.1-Fun-14B-Control; Wan-AI/Wan2.1-T2V-1.3B-->Wan2.1-Fun-V1.1-1.3B-Series; Wan2.1-Fun-V1.1-1.3B-Series-->PAI/Wan2.1-Fun-V1.1-1.3B-Control; Wan2.1-Fun-V1.1-1.3B-Series-->PAI/Wan2.1-Fun-V1.1-1.3B-InP; Wan2.1-Fun-V1.1-1.3B-Series-->PAI/Wan2.1-Fun-V1.1-1.3B-Control-Camera; Wan-AI/Wan2.1-T2V-14B-->Wan2.1-Fun-V1.1-14B-Series; Wan2.1-Fun-V1.1-14B-Series-->PAI/Wan2.1-Fun-V1.1-14B-Control; Wan2.1-Fun-V1.1-14B-Series-->PAI/Wan2.1-Fun-V1.1-14B-InP; Wan2.1-Fun-V1.1-14B-Series-->PAI/Wan2.1-Fun-V1.1-14B-Control-Camera; Wan-AI/Wan2.1-T2V-1.3B-->DiffSynth-Studio/Wan2.1-1.3b-speedcontrol-v1; Wan-AI/Wan2.1-T2V-14B-->krea/krea-realtime-video; Wan-AI/Wan2.1-T2V-14B-->meituan-longcat/LongCat-Video; Wan-AI/Wan2.1-I2V-14B-720P-->ByteDance/Video-As-Prompt-Wan2.1-14B; Wan-AI/Wan2.1-T2V-14B-->Wan-AI/Wan2.2-Animate-14B; Wan-AI/Wan2.1-T2V-14B-->Wan-AI/Wan2.2-S2V-14B; Wan2.2-Series-->Wan-AI/Wan2.2-T2V-A14B; Wan2.2-Series-->Wan-AI/Wan2.2-I2V-A14B; Wan2.2-Series-->Wan-AI/Wan2.2-TI2V-5B; Wan-AI/Wan2.2-T2V-A14B-->Wan2.2-Fun-Series; Wan2.2-Fun-Series-->PAI/Wan2.2-VACE-Fun-A14B; Wan2.2-Fun-Series-->PAI/Wan2.2-Fun-A14B-InP; Wan2.2-Fun-Series-->PAI/Wan2.2-Fun-A14B-Control; Wan2.2-Fun-Series-->PAI/Wan2.2-Fun-A14B-Control-Camera;

Examples

Example code for Wan is available at: /examples/wanvideo/

Model ID	Extra Inputs	Inference	Low VRAM Inference	Full Training	Validation After Full Training	LoRA Training	Validation After LoRA Training
Wan-AI/Wan2.1-T2V-1.3B	code	code	code	code	code	code
Wan-AI/Wan2.1-T2V-14B	code	code	code	code	code	code
Wan-AI/Wan2.1-I2V-14B-480P	input_image	code	code	code	code	code	code
Wan-AI/Wan2.1-I2V-14B-720P	input_image	code	code	code	code	code	code
Wan-AI/Wan2.1-FLF2V-14B-720P	input_image, end_image	code	code	code	code	code	code
iic/VACE-Wan2.1-1.3B-Preview	vace_control_video, vace_reference_image	code	code	code	code	code	code
Wan-AI/Wan2.1-VACE-1.3B	vace_control_video, vace_reference_image	code	code	code	code	code	code
Wan-AI/Wan2.1-VACE-14B	vace_control_video, vace_reference_image	code	code	code	code	code	code
PAI/Wan2.1-Fun-1.3B-InP	input_image, end_image	code	code	code	code	code	code
PAI/Wan2.1-Fun-1.3B-Control	control_video	code	code	code	code	code	code
PAI/Wan2.1-Fun-14B-InP	input_image, end_image	code	code	code	code	code	code
PAI/Wan2.1-Fun-14B-Control	control_video	code	code	code	code	code	code
PAI/Wan2.1-Fun-V1.1-1.3B-Control	control_video, reference_image	code	code	code	code	code	code
PAI/Wan2.1-Fun-V1.1-14B-Control	control_video, reference_image	code	code	code	code	code	code
PAI/Wan2.1-Fun-V1.1-1.3B-InP	input_image, end_image	code	code	code	code	code	code
PAI/Wan2.1-Fun-V1.1-14B-InP	input_image, end_image	code	code	code	code	code	code
PAI/Wan2.1-Fun-V1.1-1.3B-Control-Camera	control_camera_video, input_image	code	code	code	code	code	code
PAI/Wan2.1-Fun-V1.1-14B-Control-Camera	control_camera_video, input_image	code	code	code	code	code	code
DiffSynth-Studio/Wan2.1-1.3b-speedcontrol-v1	motion_bucket_id	code	code	code	code	code	code
krea/krea-realtime-video	code	code	code	code	code	code
meituan-longcat/LongCat-Video	longcat_video	code	code	code	code	code	code
ByteDance/Video-As-Prompt-Wan2.1-14B	vap_video, vap_prompt	code	code	code	code	code	code
Wan-AI/Wan2.2-T2V-A14B	code	code	code	code	code	code
Wan-AI/Wan2.2-I2V-A14B	input_image	code	code	code	code	code	code
Wan-AI/Wan2.2-TI2V-5B	input_image	code	code	code	code	code	code
Wan-AI/Wan2.2-Animate-14B	input_image, animate_pose_video, animate_face_video, animate_inpaint_video, animate_mask_video	code	code	code	code	code	code
Wan-AI/Wan2.2-S2V-14B	input_image, input_audio, audio_sample_rate, s2v_pose_video	code	code	code	code	code	code
PAI/Wan2.2-VACE-Fun-A14B	vace_control_video, vace_reference_image	code	code	code	code	code	code
PAI/Wan2.2-Fun-A14B-InP	input_image, end_image	code	code	code	code	code	code
PAI/Wan2.2-Fun-A14B-Control	control_video, reference_image	code	code	code	code	code	code
PAI/Wan2.2-Fun-A14B-Control-Camera	control_camera_video, input_image	code	code	code	code	code	code
openmoss/MOVA-360p	input_image	code	code	code	code	code	code
openmoss/MOVA-720p	input_image	code	code	code	code	code	code
Wan-AI/Wan2.2-Dancer-14B (global model)	wantodance_music_path, wantodance_reference_image, wantodance_fps, wantodance_keyframes, wantodance_keyframes_mask	code	code	code	code	code	code
Wan-AI/Wan2.2-Dancer-14B (local model)	wantodance_music_path, wantodance_reference_image, wantodance_fps, wantodance_keyframes, wantodance_keyframes_mask	code	code	code	code	code	code

Audio Synthesis

ACE-Step: /docs/en/Model_Details/ACE-Step.md

Quick Start

Running the following code will quickly load the ACE-Step/Ace-Step1.5 model and perform inference. VRAM management is enabled, and the framework will automatically control the loading of model parameters based on available VRAM. The model can run with a minimum of 3GB VRAM.

from diffsynth.pipelines.ace_step import AceStepPipeline, ModelConfig from diffsynth.utils.data.audio import save_audio import torch

pipe = AceStepPipeline.from_pretrained( torch_dtype=torch.bfloat16, device="cuda", model_configs=[ ModelConfig(model_id="ACE-Step/Ace-Step1.5", origin_file_pattern="acestep-v15-turbo/model.safetensors", **vram_config), ModelConfig(model_id="ACE-Step/Ace-Step1.5", origin_file_pattern="Qwen3-Embedding-0.6B/model.safetensors", **vram_config), ModelConfig(model_id="ACE-Step/Ace-Step1.5", origin_file_pattern="vae/diffusion_pytorch_model.safetensors", **vram_config), ], text_tokenizer_config=ModelConfig(model_id="ACE-Step/Ace-Step1.5", origin_file_pattern="Qwen3-Embedding-0.6B/"), vram_limit=torch.cuda.mem_get_info("cuda")[1] / (1024 ** 3) - 0.5, )

prompt = "An explosive, high-energy pop-rock track with a strong anime theme song feel. The song kicks off with a catchy, synthesized brass fanfare over a driving rock beat with punchy drums and a solid bassline. A powerful, clear male vocal enters with a theatrical and energetic delivery, soaring through the verses and hitting powerful high notes in the chorus. The arrangement is dense and dynamic, featuring rhythmic electric guitar chords, brief instrumental breaks with synth flourishes, and a consistent, danceable groove throughout. The overall mood is triumphant, adventurous, and exhilarating." lyrics = '[Intro - Synth Brass Fanfare]\n\n[Verse 1]\n黑夜里的风吹过耳畔\n甜蜜时光转瞬即万\n脚步飘摇在星光上\n心追节奏心跳狂乱\n耳边传来电吉他呼唤\n手指轻触碰点流点燃\n梦在云端任它蔓延\n疯狂跳跃自由无间\n\n[Chorus]\n心电感应在震动间\n拥抱未来勇敢冒险\n那旋律在心中无限\n世界变得如此耀眼\n\n[Instrumental Break - Synth Brass Melody]\n\n[Verse 2]\n鼓点撞击黑夜的底端\n跳动节拍连接你我俩\n在这里让灵魂发光\n燃尽所有不留遗憾\n\n[Instrumental Break - Synth Brass Melody]\n\n[Bridge]\n光影交错彼此的视线\n霓虹之下夜空的蔚蓝\n月光洒下温热心田\n追逐梦想它不会遥远\n\n[Chorus]\n心电感应在震动间\n拥抱未来勇敢冒险\n那旋律在心中无限\n世界变得如此耀眼\n\n[Outro - Instrumental with Synth Brass Melody]\n[Song ends abruptly]' audio = pipe( prompt=prompt, lyrics=lyrics, duration=160, bpm=100, keyscale="B minor", timesignature="4", vocal_language="zh", seed=42, )

save_audio(audio, pipe.vae.sampling_rate, "acestep-v15-turbo.wav")

Examples

Example code for ACE-Step is available at: /examples/ace_step/

Model ID	Inference	Low VRAM Inference	Full Training	Full Training Validation	LoRA Training	LoRA Training Validation
ACE-Step/Ace-Step1.5	code	code	code	code	code	code
ACE-Step/acestep-v15-turbo-shift1	code	code	code	code	code	code
ACE-Step/acestep-v15-turbo-shift3	code	code	code	code	code	code
ACE-Step/acestep-v15-turbo-continuous	code	code	code	code	code	code
ACE-Step/acestep-v15-base	code	code	code	code	code	code
ACE-Step/acestep-v15-base: CoverTask	code	code	—	—	—	—
ACE-Step/acestep-v15-base: RepaintTask	code	code	—	—	—	—
ACE-Step/acestep-v15-sft	code	code	code	code	code	code
ACE-Step/acestep-v15-xl-base	code	code	code	code	code	code
ACE-Step/acestep-v15-xl-sft	code	code	code	code	code	code
ACE-Step/acestep-v15-xl-turbo	code	code	code	code	code	code
DiffSynth-Studio/acestep15xlsft-lora-music	code	code	code	code	-	-

Image Quality Metrics Models

/docs/en/Model_Details/Image-Quality-Metrics.md

Quick Start

Run the following code to quickly load PickScore and evaluate an image against a text prompt. The default model will be downloaded from ModelScope to ./models.

from diffsynth.metrics import PickScoreMetric, ModelConfig from modelscope import dataset_snapshot_download from PIL import Image

dataset_snapshot_download( "DiffSynth-Studio/diffsynth_example_dataset", allow_file_pattern="flux/FLUX.1-dev/*", local_dir="./data/diffsynth_example_dataset", ) image = Image.open("data/diffsynth_example_dataset/flux/FLUX.1-dev/1.jpg").convert("RGB") prompt = "a dog" metric = PickScoreMetric.from_pretrained( model_config=ModelConfig(model_id="DiffSynth-Studio/ImageMetrics", origin_file_pattern="PickScore/model.safetensors"), device="cuda" ) score = metric.compute(prompt, image)[0] print(f"PickScore score:: {score:.3f}")

Example Code

Example code for image quality metrics models can be found at: /examples/image_quality_metric/

Metric	GitHub Repository	Example Code
PickScore	GitHub	code
ImageReward	GitHub	code
HPSv2	GitHub	code
HPSv3	GitHub	code
CLIP Score	GitHub	code
Aesthetic	GitHub	code
FID	GitHub	code

Innovative Achievements

DiffSynth-Studio is not just an engineered model framework, but also an incubator for innovative achievements.

Spectral Evolution Search: Efficient Inference-Time Scaling for Reward-Aligned Image Generation

FLUX.1-dev	FLUX.1-dev + SES	Qwen-Image	Qwen-Image + SES

VIRAL: Visual In-Context Reasoning via Analogy in Diffusion Transformers

Paper: VIRAL: Visual In-Context Reasoning via Analogy in Diffusion Transformers
Sample code: /examples/qwen_image/model_inference/Qwen-Image-Edit-2511-ICEdit.py
Model: ModelScope

Example 1	Example 2	Query	Output

AttriCtrl: Attribute Intensity Control for Image Generation Models

Paper: AttriCtrl: Fine-Grained Control of Aesthetic Attribute Intensity in Diffusion Models
Sample Code: /examples/flux/model_inference/FLUX.1-dev-AttriCtrl.py
Model: ModelScope

brightness scale = 0.1	brightness scale = 0.3	brightness scale = 0.5	brightness scale = 0.7	brightness scale = 0.9

AutoLoRA: Automated LoRA Retrieval and Fusion

Paper: AutoLoRA: Automatic LoRA Retrieval and Fine-Grained Gated Fusion for Text-to-Image Generation
Sample Code: /examples/flux/model_inference/FLUX.1-dev-LoRA-Fusion.py
Model: ModelScope

LoRA 1	LoRA 2	LoRA 3	LoRA 4
LoRA 1
LoRA 2
LoRA 3
LoRA 4

Nexus-Gen: Unified Architecture for Image Understanding, Generation, and Editing

Detailed Page: https://github.com/modelscope/Nexus-Gen
Paper: Nexus-Gen: Unified Image Understanding, Generation, and Editing via Prefilled Autoregression in Shared Embedding Space
Model: ModelScope, HuggingFace
Dataset: ModelScope Dataset
Online Experience: ModelScope Nexus-Gen Studio

ArtAug: Aesthetic Enhancement for Image Generation Models

Detailed Page: ./examples/ArtAug/
Paper: ArtAug: Enhancing Text-to-Image Generation through Synthesis-Understanding Interaction
Model: ModelScope, HuggingFace
Online Experience: ModelScope AIGC Tab

FLUX.1-dev	FLUX.1-dev + ArtAug LoRA

EliGen: Precise Image Partition Control

Paper: EliGen: Entity-Level Controlled Image Generation with Regional Attention
Sample Code: /examples/flux/model_inference/FLUX.1-dev-EliGen.py
Model: ModelScope, HuggingFace
Online Experience: ModelScope EliGen Studio
Dataset: EliGen Train Set

Entity Control Region	Generated Image

ExVideo: Extended Training for Video Generation Models

Project Page: Project Page
Paper: ExVideo: Extending Video Diffusion Models via Parameter-Efficient Post-Tuning
Sample Code: Please refer to the older version
Model: ModelScope, HuggingFace github_title.mp4 Diffutoon: High-Resolution Anime-Style Video Rendering
Project Page: Project Page
Paper: Diffutoon: High-Resolution Editable Toon Shading via Diffusion Models
Sample Code: Please refer to the older version Diffutoon.mp4 DiffSynth: The Original Version of This Project
Project Page: Project Page
Paper: DiffSynth: Latent In-Iteration Deflickering for Realistic Video Synthesis
Sample Code: Please refer to the older version winter_stone.mp4

Contact Us

Discord：https://discord.gg/Mm9suEeUDc