Z-Image (original) (raw)

LoRA

Z-Image is a powerful and highly efficient image generation model with 6B parameters. Currently there’s only one model with two more to be released:

Model Hugging Face
Z-Image-Turbo https://huggingface.co/Tongyi-MAI/Z-Image-Turbo

Z-Image-Turbo

Z-Image-Turbo is a distilled version of Z-Image that matches or exceeds leading competitors with only 8 NFEs (Number of Function Evaluations). It offers sub-second inference latency on enterprise-grade H800 GPUs and fits comfortably within 16G VRAM consumer devices. It excels in photorealistic image generation, bilingual text rendering (English & Chinese), and robust instruction adherence.

Image-to-image

Use ZImageImg2ImgPipeline to transform an existing image based on a text prompt.

import torch from diffusers import ZImageImg2ImgPipeline from diffusers.utils import load_image

pipe = ZImageImg2ImgPipeline.from_pretrained("Tongyi-MAI/Z-Image-Turbo", torch_dtype=torch.bfloat16) pipe.to("cuda")

url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg" init_image = load_image(url).resize((1024, 1024))

prompt = "A fantasy landscape with mountains and a river, detailed, vibrant colors" image = pipe( prompt, image=init_image, strength=0.6, num_inference_steps=9, guidance_scale=0.0, generator=torch.Generator("cuda").manual_seed(42), ).images[0] image.save("zimage_img2img.png")

ZImagePipeline

class diffusers.ZImagePipeline

< source >

( scheduler: FlowMatchEulerDiscreteScheduler vae: AutoencoderKL text_encoder: PreTrainedModel tokenizer: AutoTokenizer transformer: ZImageTransformer2DModel )

__call__

< source >

( prompt: typing.Union[str, typing.List[str]] = None height: typing.Optional[int] = None width: typing.Optional[int] = None num_inference_steps: int = 50 sigmas: typing.Optional[typing.List[float]] = None guidance_scale: float = 5.0 cfg_normalization: bool = False cfg_truncation: float = 1.0 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None num_images_per_prompt: typing.Optional[int] = 1 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.FloatTensor] = None prompt_embeds: typing.Optional[typing.List[torch.FloatTensor]] = None negative_prompt_embeds: typing.Optional[typing.List[torch.FloatTensor]] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True joint_attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None callback_on_step_end: typing.Optional[typing.Callable[[int, int, typing.Dict], NoneType]] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] max_sequence_length: int = 512 ) → ZImagePipelineOutput or tuple

Parameters

Returns

ZImagePipelineOutput or tuple

ZImagePipelineOutput ifreturn_dict is True, otherwise a tuple. When returning a tuple, the first element is a list with the generated images.

Function invoked when calling the pipeline for generation.

Examples:

import torch from diffusers import ZImagePipeline

pipe = ZImagePipeline.from_pretrained("Z-a-o/Z-Image-Turbo", torch_dtype=torch.bfloat16) pipe.to("cuda")

prompt = "一幅为名为“造相「Z-IMAGE-TURBO」”的项目设计的创意海报。画面巧妙地将文字概念视觉化:一辆复古蒸汽小火车化身为巨大的拉链头,正拉开厚厚的冬日积雪,展露出一个生机盎然的春天。" image = pipe( ... prompt, ... height=1024, ... width=1024, ... num_inference_steps=9, ... guidance_scale=0.0, ... generator=torch.Generator("cuda").manual_seed(42), ... ).images[0] image.save("zimage.png")

ZImageImg2ImgPipeline

class diffusers.ZImageImg2ImgPipeline

< source >

( scheduler: FlowMatchEulerDiscreteScheduler vae: AutoencoderKL text_encoder: PreTrainedModel tokenizer: AutoTokenizer transformer: ZImageTransformer2DModel )

Parameters

The ZImage pipeline for image-to-image generation.

__call__

< source >

( prompt: typing.Union[str, typing.List[str]] = None image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor]] = None strength: float = 0.6 height: typing.Optional[int] = None width: typing.Optional[int] = None num_inference_steps: int = 50 sigmas: typing.Optional[typing.List[float]] = None guidance_scale: float = 5.0 cfg_normalization: bool = False cfg_truncation: float = 1.0 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None num_images_per_prompt: typing.Optional[int] = 1 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.FloatTensor] = None prompt_embeds: typing.Optional[typing.List[torch.FloatTensor]] = None negative_prompt_embeds: typing.Optional[typing.List[torch.FloatTensor]] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True joint_attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None callback_on_step_end: typing.Optional[typing.Callable[[int, int, typing.Dict], NoneType]] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] max_sequence_length: int = 512 ) → ZImagePipelineOutput or tuple

Parameters

Returns

ZImagePipelineOutput or tuple

ZImagePipelineOutput ifreturn_dict is True, otherwise a tuple. When returning a tuple, the first element is a list with the generated images.

Function invoked when calling the pipeline for image-to-image generation.

Examples:

import torch from diffusers import ZImageImg2ImgPipeline from diffusers.utils import load_image

pipe = ZImageImg2ImgPipeline.from_pretrained("Z-a-o/Z-Image-Turbo", torch_dtype=torch.bfloat16) pipe.to("cuda")

url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg" init_image = load_image(url).resize((1024, 1024))

prompt = "A fantasy landscape with mountains and a river, detailed, vibrant colors" image = pipe( ... prompt, ... image=init_image, ... strength=0.6, ... num_inference_steps=9, ... guidance_scale=0.0, ... generator=torch.Generator("cuda").manual_seed(42), ... ).images[0] image.save("zimage_img2img.png")

Update on GitHub