Multi-concept Customization of Text-to-Image Diffusion (original) (raw)

Custom Diffusion

While generative models produce high-quality images of concepts learned from a large-scale database, a user often wishes to synthesize instantiations of their own concepts (for example, their family, pets, or items). Can we teach a model to quickly acquire a new concept, given a few examples? Furthermore, can we compose multiple new concepts together?

We propose Custom Diffusion, an efficient method for augmenting existing text-to-image models. We find that only optimizing a few parameters in the text-to-image conditioning mechanism is sufficiently powerful to represent new concepts while enabling fast tuning. Additionally, we can jointly train for multiple concepts or combine multiple fine-tuned models into one via closed-form constrained optimization. Our fine-tuned model generates variations of multiple new concepts in novel unseen settings.

Our method is fast (~6 minutes on 2 A100 GPUs) and has low storage requirements (75MB) for each additional concept model apart from the pretrained model. This can be further compressed to 5 - 15 MB by only saving a low-rank approximation of the weight updates.

Single-Concept Results

We show results of our fine-tuning method on various category of new/personalized concept including scene, style, pet, personal toy, and objects. For more generations and comparison with concurrent methods please refer to our Gallery page.