unCLIP (original) (raw)

Hierarchical Text-Conditional Image Generation with CLIP Latents is by Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, Mark Chen. The unCLIP model in πŸ€— Diffusers comes from kakaobrain’s karlo.

The abstract from the paper is following:

Contrastive models like CLIP have been shown to learn robust representations of images that capture both semantics and style. To leverage these representations for image generation, we propose a two-stage model: a prior that generates a CLIP image embedding given a text caption, and a decoder that generates an image conditioned on the image embedding. We show that explicitly generating image representations improves image diversity with minimal loss in photorealism and caption similarity. Our decoders conditioned on image representations can also produce variations of an image that preserve both its semantics and style, while varying the non-essential details absent from the image representation. Moreover, the joint embedding space of CLIP enables language-guided image manipulations in a zero-shot fashion. We use diffusion models for the decoder and experiment with both autoregressive and diffusion models for the prior, finding that the latter are computationally more efficient and produce higher-quality samples.

You can find lucidrains’ DALL-E 2 recreation at lucidrains/DALLE2-pytorch.

Make sure to check out the Schedulers guide to learn how to explore the tradeoff between scheduler speed and quality, and see the reuse components across pipelines section to learn how to efficiently load the same components into multiple pipelines.

UnCLIPPipeline

class diffusers.UnCLIPPipeline

< source >

( prior: PriorTransformer decoder: UNet2DConditionModel text_encoder: CLIPTextModelWithProjection tokenizer: CLIPTokenizer text_proj: UnCLIPTextProjModel super_res_first: UNet2DModel super_res_last: UNet2DModel prior_scheduler: UnCLIPScheduler decoder_scheduler: UnCLIPScheduler super_res_scheduler: UnCLIPScheduler )

Parameters

Pipeline for text-to-image generation using unCLIP.

This model inherits from DiffusionPipeline. Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.).

__call__

< source >

( prompt: typing.Union[str, typing.List[str], NoneType] = None num_images_per_prompt: int = 1 prior_num_inference_steps: int = 25 decoder_num_inference_steps: int = 25 super_res_num_inference_steps: int = 7 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None prior_latents: typing.Optional[torch.Tensor] = None decoder_latents: typing.Optional[torch.Tensor] = None super_res_latents: typing.Optional[torch.Tensor] = None text_model_output: typing.Union[transformers.models.clip.modeling_clip.CLIPTextModelOutput, typing.Tuple, NoneType] = None text_attention_mask: typing.Optional[torch.Tensor] = None prior_guidance_scale: float = 4.0 decoder_guidance_scale: float = 8.0 output_type: typing.Optional[str] = 'pil' return_dict: bool = True ) β†’ ImagePipelineOutput or tuple

Parameters

If return_dict is True, ImagePipelineOutput is returned, otherwise a tuple is returned where the first element is a list with the generated images.

The call function to the pipeline for generation.

UnCLIPImageVariationPipeline

class diffusers.UnCLIPImageVariationPipeline

< source >

( decoder: UNet2DConditionModel text_encoder: CLIPTextModelWithProjection tokenizer: CLIPTokenizer text_proj: UnCLIPTextProjModel feature_extractor: CLIPImageProcessor image_encoder: CLIPVisionModelWithProjection super_res_first: UNet2DModel super_res_last: UNet2DModel decoder_scheduler: UnCLIPScheduler super_res_scheduler: UnCLIPScheduler )

Parameters

Pipeline to generate image variations from an input image using UnCLIP.

This model inherits from DiffusionPipeline. Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.).

__call__

< source >

( image: typing.Union[PIL.Image.Image, typing.List[PIL.Image.Image], torch.Tensor, NoneType] = None num_images_per_prompt: int = 1 decoder_num_inference_steps: int = 25 super_res_num_inference_steps: int = 7 generator: typing.Optional[torch._C.Generator] = None decoder_latents: typing.Optional[torch.Tensor] = None super_res_latents: typing.Optional[torch.Tensor] = None image_embeddings: typing.Optional[torch.Tensor] = None decoder_guidance_scale: float = 8.0 output_type: typing.Optional[str] = 'pil' return_dict: bool = True ) β†’ ImagePipelineOutput or tuple

Parameters

If return_dict is True, ImagePipelineOutput is returned, otherwise a tuple is returned where the first element is a list with the generated images.

The call function to the pipeline for generation.

ImagePipelineOutput

class diffusers.ImagePipelineOutput

< source >

( images: typing.Union[typing.List[PIL.Image.Image], numpy.ndarray] )

Parameters

Output class for image pipelines.

< > Update on GitHub