Diffusion Self-Guidance for Controllable Image Generation (original) (raw)

1UC Berkeley

2Google Research

NeurIPS 2023

“a meatball and a donut falling from the clouds onto a neighborhood”

Original

Move donut

Resize donut

Replace donut

Copy scene appearance

Copy scene layout

“a macaron and a croissant in the seine with the eiffel tower visible”

Original

Swap objects

Enlarge macaron

Replace macaron

Copy scene appearance

Copy scene layout

“a giant macaron and a croissant in the seine with the eiffel tower visible”

Original

Move donut

Shrink donut

Replace donut

Copy scene appearance

Copy scene layout

Original

Swap objects

Enlarge macaron

Replace macaron

Copy scene appearance

Copy scene layout

“a meatball and a donut falling from the clouds onto a neighborhood”

Original

Move

Resize

Restyle

Copy appearance

Copy layout

Original

Move donut

Shrink donut

Replace donut

Copy scene appearance

Copy scene layout

TL;DR: Self-guidance is a method for controllable image generation that guides sampling using only the attention and activations of a pretrained diffusion model.

Without any extra models or training, you can move or resize objects, or even replace them with items from real images, without changing the rest of the scene. You can also borrow the appearance of another image or rearrange scenes into a desired layout.

Large-scale generative models are capable of producing high-quality images from detailed text descriptions. However, many aspects of an image are difficult or impossible to convey through text. We introduce self-guidance, a method that provides greater control over generated images by guiding the internal representations of diffusion models. We demonstrate that properties such as the shape, location, and appearance of objects can be extracted from these representations and used to steer sampling.

Self-guidance works similarly to classifier guidance, but uses signals present in the pretrained model itself, requiring no additional models or training. We show how a simple set of properties can be composed to perform challenging image manipulations, such as modifying the position or size of objects, merging the appearance of objects in one image with the layout of another, composing objects from many images into one, and more. We also show that self-guidance can be used to edit real images.

paletteResults

open_withMove and resize objects

Using self-guidance to only change the properties of one object, we can move or resize that object without modifying the rest of the image. Pick a prompt and an edit and explore for yourself.

prompt:

“a _raccoon in a barrel_ going down a waterfall”

“distant shot of the tokyo tower with a _massive sun_ in the sky”

“a fluffy cat sitting on a museum bench looking at an _oil painting of cheese_”

edit:

move ↑

move ↓

move ←

move →

shrink

enlarge

open_withAppearance transfer from real images

By guiding the appearance of a generated object to match that of one in a real image, we can create scenes depicting an object from real life, similarly to DreamBooth, butwithout any fine-tuning and only using one image.

prompt:

“a photo of a chow chow wearing a ... outfit”

“a DSLR photo of a teapot...”

edit:

“purple wizard”

“chef”

“superman”

edit:

“floating in milk”

“pouring tea”

“floating in the sea”

Real image

Ours

DreamBooth

open_withReal image editing

Our method also enables the spatial manipulation of objects in real images.

prompt:

“an_eclair_ and a shot of espresso” “a hot dog, fries, and a _soda_ on a solid background”

edit:

shrink width

reconstruct

move

enlarge

restyle

edit:

make narrow and tall

restyle

shrink width

reconstruct

swap soda and fries

open_withSample new appearances

By guiding object shapes toward reconstruction of an image's layout, we can sample new appearances for a given scene. We compare to ControlNet v1.1-Depth andPrompt-to-Prompt. Switch between the different styles below.

prompt:

“a bear wearing a suit eating his birthday cake out of the fridge in a dark kitchen”

“a parrot riding a horse down a city street”

edit:

appearance 1

appearance 2

appearance 3

controlnet

prompt-to-prompt

open_withMix-and-match

By guiding samples to take object shapes from one image and appearance from another, we can rearrange images into layouts from other scenes. We can also sample new layouts of a scene by only guiding appearance. Find your favorite combination below.

layout:

#1

#2

#3

#4

random #1

random #2

“a suitcase, a bowling ball, and a phone washed up on a beach after a shipwreck”

Appearance

Layout

Combined

open_withCompositional generation

A new scene can be created by collaging individual objects from different images (the first three columns here). Alternatively — e.g., if objects cannot be combined at their original locations due to incompatibilities in these images' layouts (*as in the bottom row) — we can borrow only their appearance, and specify layout with a new image to produce a composition (last two columns).

“a picnic blanket, a fruit tree, and a car by the lake”

Take blanket

Take tree

Take car

Result

+ Target layout

Final result

Take blanket

Take tree

Take car

Result

+ Target layout

Final result

“a top-down photo of a tea kettle, a bowl of fruit, and a cup of matcha”

Take matcha

Take kettle

Take fruit

Result

+ Target layout

Final result

Take matcha

Take kettle

Take fruit

Result

+ Target layout

Final result

“a dog wearing a knit sweater and a baseball cap drinking a cocktail”

Take sweater

Take cocktail

Take cap

Result*

+ Target layout

Final result

Take sweater

Take cocktail

Take cap

Result*

+ Target layout

Final result

open_withManipulating non-objects

The properties of any word in the input prompt can be manipulated, not only nouns. Here, we show examples of relocating adjectives and verbs. The last example shows a case in which additional self-guidance can correct improper attribute binding.

Move laughing right

“a cat and a monkey laughing on a road”

Original

Modified

Change messy location

“a messy room”

At (0.3,0.6)

At (0.8,0.8)

Move red to jacket, yellow to shoes

“green hat, blue book, yellow shoes, red jacket”

Original

Fixed

“a cat and a monkey laughing on a road”

“a messy room”

“green hat, blue book, yellow shoes, red jacket”

Original

Modified

At (0.3,0.6)

At (0.8,0.8)

Original

Fixed

Move laughing right

Change messy location

Move red to jacket, yellow to shoes

open_withLimitations

Setting high guidance weights for appearance terms tends to introduce unwanted leakage of object position. Similarly, while heavily guiding the shape of one word matches that object’s layout as expected, high guidance on all token shapes leaks appearance information. Finally, in some cases, objects are entangled in attention space, making it difficult to control them independently.

Appearance features leak layout

“a squirrel trying to catch a lime mid-air”

Unguided

lime guided

Multi-token layout leaks appearance

“a picture of a cake”

Real image

Layout guided

Interacting objects are entangled

“a potato sitting on a couch with a bowl of popcorn watching football”

Original

Move potato →

“a squirrel trying to catch a lime mid-air”

“a picture of a cake”

“a potato sitting on a couch with a bowl of popcorn watching football”

Unguided

lime guided

Real image

Layout guided

Original

Move potato →

Appearance features leak layout

Multi-token layout leaks appearance

Interacting objects are entangled

format_quoteCitation

@article{epstein2023selfguidance, title={Diffusion Self-Guidance for Controllable Image Generation}, author={Epstein, Dave and Jabri, Allan and Poole, Ben and Efros, Alexei A. and Holynski, Aleksander}, booktitle={Advances in Neural Information Processing Systems}, year={2023} }

Acknowledgements

We thank Oliver Wang, Jason Baldridge, Lucy Chai, and Minyoung Huh for their helpful comments. Dave is supported by the PD Soros Fellowship. Dave and Allan conducted part of this research at Google, with additional funding provided by DARPA MCS and ONR MURI.