Guidance on Training Stable Diffusion Models for Image Generation with Multiple Object Categories (original) (raw)
What worked for me was doing a 3-concept training for each concept I wanted to merge-- for instance, I wanted to make photos of myself, Chad, and my fiance, Courtney, in different famous places, and since the famous places are already pretty much trained into the base model, I just needed to put our likeness in the new model I was training. But when I did the concept for her, then me, in the same training, it wouldn’t combine us properly. It would be a horrific combination of the two of us, or her with my facial hair, or me with her body… shudder no thank you!
My workaround, that happened to be really, really great and works almost every time, flawlessly, with little to no unexpected or unwanted results, still maintaining it’s ability to change our appearances (i.e. both with silly mustaches, both wearing formal wear, etc), and still being able to use just one, or both, of us, is by using a concept list with 3 concepts, which were:
“photo of Chad” instance and “photo of a man” for class
“photo of Courtney” instance and “photo of a lady” for class
and “:photo of Chad and Courtney” instance and “photo of a couple” for class.
I did 20 photos each for the instance, and 2000 (so 6,000 total) for the class. I trained that at a learning rate of 1-e6, constant, 0 warm-up steps, and let it train on one-to-one with the class images, meaning I did 6,000 steps. I saved a copy of the diffusers model every 500 images, because I wanted to see what’s the best spot. Turns out that 6,000, on the nose, was the best.
I’ve done one of these on the Automatic 1111 webui Dreambooth, as well as on Google Colab. Both came out very similar. For the base model, I usually use any of the photorealistic models I can find on huggingface that are already sectioned out as a diffusers model, like the google colab requires, because I’m too lazy to convert any of my favorite safetensors to diffusers then upload my own model and use that to reference. The automatic 1111 webui makes that part easier, but, if you don’t label your 1.5,2.0,2.1,XL models and if they’re fp16,bf16 or full/float models appropriately, it quickly gets confusing as to why your training keeps failing because your shapes are different and won’t jive.
Without knowing what the objects you’re training are, or if they’re people, or what, but… that’s how I solved my problem. When I prompt with that above mentioned model, I can use “photo of chad” or “photo of courtney” or “photo of chad standing next to a ___ with courtney standing next to a ____” and it works remarkably well every single time.
Hope this helps.