Z-Image (original) (raw)

Z-Image is a powerful and highly efficient image generation model with 6B parameters. Currently there’s only one model with two more to be released:

Model	Hugging Face
Z-Image-Turbo	https://huggingface.co/Tongyi-MAI/Z-Image-Turbo

Z-Image-Turbo

Z-Image-Turbo is a distilled version of Z-Image that matches or exceeds leading competitors with only 8 NFEs (Number of Function Evaluations). It offers sub-second inference latency on enterprise-grade H800 GPUs and fits comfortably within 16G VRAM consumer devices. It excels in photorealistic image generation, bilingual text rendering (English & Chinese), and robust instruction adherence.

Image-to-image

Use ZImageImg2ImgPipeline to transform an existing image based on a text prompt.

import torch from diffusers import ZImageImg2ImgPipeline from diffusers.utils import load_image

pipe = ZImageImg2ImgPipeline.from_pretrained("Tongyi-MAI/Z-Image-Turbo", torch_dtype=torch.bfloat16) pipe.to("cuda")

url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg" init_image = load_image(url).resize((1024, 1024))

prompt = "A fantasy landscape with mountains and a river, detailed, vibrant colors" image = pipe( prompt, image=init_image, strength=0.6, num_inference_steps=9, guidance_scale=0.0, generator=torch.Generator("cuda").manual_seed(42), ).images[0] image.save("zimage_img2img.png")

ZImagePipeline

class diffusers.ZImagePipeline

< source >

( scheduler: FlowMatchEulerDiscreteScheduler vae: AutoencoderKL text_encoder: PreTrainedModel tokenizer: AutoTokenizer transformer: ZImageTransformer2DModel )

call

< source >

( prompt: typing.Union[str, typing.List[str]] = None height: typing.Optional[int] = None width: typing.Optional[int] = None num_inference_steps: int = 50 sigmas: typing.Optional[typing.List[float]] = None guidance_scale: float = 5.0 cfg_normalization: bool = False cfg_truncation: float = 1.0 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None num_images_per_prompt: typing.Optional[int] = 1 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.FloatTensor] = None prompt_embeds: typing.Optional[typing.List[torch.FloatTensor]] = None negative_prompt_embeds: typing.Optional[typing.List[torch.FloatTensor]] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True joint_attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None callback_on_step_end: typing.Optional[typing.Callable[[int, int, typing.Dict], NoneType]] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] max_sequence_length: int = 512 ) → ZImagePipelineOutput or tuple

Parameters

prompt (str or List[str], optional) — The prompt or prompts to guide the image generation. If not defined, one has to pass prompt_embeds. instead.
height (int, optional, defaults to 1024) — The height in pixels of the generated image.
width (int, optional, defaults to 1024) — The width in pixels of the generated image.
num_inference_steps (int, optional, defaults to 50) — The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference.
sigmas (List[float], optional) — Custom sigmas to use for the denoising process with schedulers which support a sigmas argument in their set_timesteps method. If not defined, the default behavior when num_inference_steps is passed will be used.
guidance_scale (float, optional, defaults to 5.0) — Guidance scale as defined in Classifier-Free Diffusion Guidance.guidance_scale is defined as w of equation 2. of Imagen Paper. Guidance scale is enabled by setting guidance_scale > 1. Higher guidance scale encourages to generate images that are closely linked to the text prompt, usually at the expense of lower image quality.
cfg_normalization (bool, optional, defaults to False) — Whether to apply configuration normalization.
cfg_truncation (float, optional, defaults to 1.0) — The truncation value for configuration.
negative_prompt (str or List[str], optional) — The prompt or prompts not to guide the image generation. If not defined, one has to passnegative_prompt_embeds instead. Ignored when not using guidance (i.e., ignored if guidance_scale is less than 1).
num_images_per_prompt (int, optional, defaults to 1) — The number of images to generate per prompt.
generator (torch.Generator or List[torch.Generator], optional) — One or a list of torch generator(s)to make generation deterministic.
latents (torch.FloatTensor, optional) — Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor will be generated by sampling using the supplied random generator.
prompt_embeds (List[torch.FloatTensor], optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated from prompt input argument.
negative_prompt_embeds (List[torch.FloatTensor], optional) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated from negative_prompt input argument.
output_type (str, optional, defaults to "pil") — The output format of the generate image. Choose betweenPIL: PIL.Image.Image or np.array.
return_dict (bool, optional, defaults to True) — Whether or not to return a ~pipelines.stable_diffusion.ZImagePipelineOutput instead of a plain tuple.
joint_attention_kwargs (dict, optional) — A kwargs dictionary that if specified is passed along to the AttentionProcessor as defined underself.processor indiffusers.models.attention_processor.
callback_on_step_end (Callable, optional) — A function that calls at the end of each denoising steps during the inference. The function is called with the following arguments: callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict). callback_kwargs will include a list of all tensors as specified bycallback_on_step_end_tensor_inputs.
callback_on_step_end_tensor_inputs (List, optional) — The list of tensor inputs for the callback_on_step_end function. The tensors specified in the list will be passed as callback_kwargs argument. You will only be able to include variables listed in the._callback_tensor_inputs attribute of your pipeline class.
max_sequence_length (int, optional, defaults to 512) — Maximum sequence length to use with the prompt.

Returns

ZImagePipelineOutput or tuple

ZImagePipelineOutput ifreturn_dict is True, otherwise a tuple. When returning a tuple, the first element is a list with the generated images.

Function invoked when calling the pipeline for generation.

Examples:

import torch from diffusers import ZImagePipeline

pipe = ZImagePipeline.from_pretrained("Z-a-o/Z-Image-Turbo", torch_dtype=torch.bfloat16) pipe.to("cuda")

prompt = "一幅为名为“造相「Z-IMAGE-TURBO」”的项目设计的创意海报。画面巧妙地将文字概念视觉化：一辆复古蒸汽小火车化身为巨大的拉链头，正拉开厚厚的冬日积雪，展露出一个生机盎然的春天。" image = pipe( ... prompt, ... height=1024, ... width=1024, ... num_inference_steps=9, ... guidance_scale=0.0, ... generator=torch.Generator("cuda").manual_seed(42), ... ).images[0] image.save("zimage.png")

ZImageImg2ImgPipeline

class diffusers.ZImageImg2ImgPipeline

< source >

( scheduler: FlowMatchEulerDiscreteScheduler vae: AutoencoderKL text_encoder: PreTrainedModel tokenizer: AutoTokenizer transformer: ZImageTransformer2DModel )

Parameters

scheduler (FlowMatchEulerDiscreteScheduler) — A scheduler to be used in combination with transformer to denoise the encoded image latents.
vae (AutoencoderKL) — Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations.
text_encoder (PreTrainedModel) — A text encoder model to encode text prompts.
tokenizer (AutoTokenizer) — A tokenizer to tokenize text prompts.
transformer (ZImageTransformer2DModel) — A ZImage transformer model to denoise the encoded image latents.

The ZImage pipeline for image-to-image generation.

call

< source >

( prompt: typing.Union[str, typing.List[str]] = None image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor]] = None strength: float = 0.6 height: typing.Optional[int] = None width: typing.Optional[int] = None num_inference_steps: int = 50 sigmas: typing.Optional[typing.List[float]] = None guidance_scale: float = 5.0 cfg_normalization: bool = False cfg_truncation: float = 1.0 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None num_images_per_prompt: typing.Optional[int] = 1 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.FloatTensor] = None prompt_embeds: typing.Optional[typing.List[torch.FloatTensor]] = None negative_prompt_embeds: typing.Optional[typing.List[torch.FloatTensor]] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True joint_attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None callback_on_step_end: typing.Optional[typing.Callable[[int, int, typing.Dict], NoneType]] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] max_sequence_length: int = 512 ) → ZImagePipelineOutput or tuple

Parameters

prompt (str or List[str], optional) — The prompt or prompts to guide the image generation. If not defined, one has to pass prompt_embeds. instead.
image (torch.Tensor, PIL.Image.Image, np.ndarray, List[torch.Tensor], List[PIL.Image.Image], or List[np.ndarray]) —Image, numpy array or tensor representing an image batch to be used as the starting point. For both numpy array and pytorch tensor, the expected value range is between [0, 1]. If it’s a tensor or a list of tensors, the expected shape should be (B, C, H, W) or (C, H, W). If it is a numpy array or a list of arrays, the expected shape should be (B, H, W, C) or (H, W, C).
strength (float, optional, defaults to 0.6) — Indicates extent to transform the reference image. Must be between 0 and 1. image is used as a starting point and more noise is added the higher the strength. The number of denoising steps depends on the amount of noise initially added. When strength is 1, added noise is maximum and the denoising process runs for the full number of iterations specified in num_inference_steps. A value of 1 essentially ignores image.
height (int, optional, defaults to 1024) — The height in pixels of the generated image. If not provided, uses the input image height.
width (int, optional, defaults to 1024) — The width in pixels of the generated image. If not provided, uses the input image width.
num_inference_steps (int, optional, defaults to 50) — The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference.
sigmas (List[float], optional) — Custom sigmas to use for the denoising process with schedulers which support a sigmas argument in their set_timesteps method. If not defined, the default behavior when num_inference_steps is passed will be used.
guidance_scale (float, optional, defaults to 5.0) — Guidance scale as defined in Classifier-Free Diffusion Guidance.guidance_scale is defined as w of equation 2. of Imagen Paper. Guidance scale is enabled by setting guidance_scale > 1. Higher guidance scale encourages to generate images that are closely linked to the text prompt, usually at the expense of lower image quality.
cfg_normalization (bool, optional, defaults to False) — Whether to apply configuration normalization.
cfg_truncation (float, optional, defaults to 1.0) — The truncation value for configuration.
negative_prompt (str or List[str], optional) — The prompt or prompts not to guide the image generation. If not defined, one has to passnegative_prompt_embeds instead. Ignored when not using guidance (i.e., ignored if guidance_scale is less than 1).
num_images_per_prompt (int, optional, defaults to 1) — The number of images to generate per prompt.
generator (torch.Generator or List[torch.Generator], optional) — One or a list of torch generator(s)to make generation deterministic.
latents (torch.FloatTensor, optional) — Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor will be generated by sampling using the supplied random generator.
prompt_embeds (List[torch.FloatTensor], optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated from prompt input argument.
negative_prompt_embeds (List[torch.FloatTensor], optional) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated from negative_prompt input argument.
output_type (str, optional, defaults to "pil") — The output format of the generate image. Choose betweenPIL: PIL.Image.Image or np.array.
return_dict (bool, optional, defaults to True) — Whether or not to return a ~pipelines.stable_diffusion.ZImagePipelineOutput instead of a plain tuple.
joint_attention_kwargs (dict, optional) — A kwargs dictionary that if specified is passed along to the AttentionProcessor as defined underself.processor indiffusers.models.attention_processor.
callback_on_step_end (Callable, optional) — A function that calls at the end of each denoising steps during the inference. The function is called with the following arguments: callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict). callback_kwargs will include a list of all tensors as specified bycallback_on_step_end_tensor_inputs.
callback_on_step_end_tensor_inputs (List, optional) — The list of tensor inputs for the callback_on_step_end function. The tensors specified in the list will be passed as callback_kwargs argument. You will only be able to include variables listed in the._callback_tensor_inputs attribute of your pipeline class.
max_sequence_length (int, optional, defaults to 512) — Maximum sequence length to use with the prompt.

Returns

ZImagePipelineOutput or tuple

ZImagePipelineOutput ifreturn_dict is True, otherwise a tuple. When returning a tuple, the first element is a list with the generated images.

Function invoked when calling the pipeline for image-to-image generation.

Examples:

import torch from diffusers import ZImageImg2ImgPipeline from diffusers.utils import load_image

pipe = ZImageImg2ImgPipeline.from_pretrained("Z-a-o/Z-Image-Turbo", torch_dtype=torch.bfloat16) pipe.to("cuda")

url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg" init_image = load_image(url).resize((1024, 1024))

prompt = "A fantasy landscape with mountains and a river, detailed, vibrant colors" image = pipe( ... prompt, ... image=init_image, ... strength=0.6, ... num_inference_steps=9, ... guidance_scale=0.0, ... generator=torch.Generator("cuda").manual_seed(42), ... ).images[0] image.save("zimage_img2img.png")

Update on GitHub

Z-Image (original) (raw)

Z-Image-Turbo

Image-to-image

ZImagePipeline

class diffusers.ZImagePipeline

__call__

ZImageImg2ImgPipeline

class diffusers.ZImageImg2ImgPipeline

__call__

call

call