Hyper-Realistic Human Generation with Latent Structural Diffusion (original) (raw)

Zero-Shot Evaluation on MS-COCO 2014 Validation Human. We compare our model with recent SOTA general T2I models (Stable Diffusion v1.5, v2.0, v2.1; SDXL; DeepFloyd-IF) and controllable methods (ControlNet; T2I-Adapter; HumanSD). Note that SDXL generates artistic style in 512x512, and IF only creates fixed-size images, we first generate 1024x1024 results, then resize back to 512x512 for these two methods. We bold the best and underline the

second

results for clarity. Our improvements over the second method are shown in red.

Evaluation Curves on MS-COCO 2014 Validation Human Subset. We show FID-CLIP (left) and FIDCLIP-CLIP (right) curves with CFG scale ranging from 4.0 to 20.0 for all methods.

User Preference Comparisons. We report the ratio of users prefer our model to baselines.