SnapGen: Taming High-Resolution Text-to-Image Models for Mobile Devices with Efficient Architectures and Training (original) (raw)

Jierun Chen1,2,* Dongting Hu1,3,* Xijie Huang1,2,* Huseyin Coskun1 Arpit Sahni1 Aarush Gupta1 Anujraaj Goyal1 Dishani Lahiri1 Rajesh Singh1 Yerlan Idelbayev1 Junli Cao1 Yanyu Li1 Kwang-Ting Cheng2 Mingming Gong3,4 S.-H. Gary Chan2 Sergey Tulyakov1 Anil Kag1,† Yanwu Xu1,† Jian Ren1,†

1Snap Inc. 2HKUST 3The University of Melbourne 4MBZUAI

*Equal contribution †Equal advising

SnapGen is the first image generation model (379M) that can synthesize high-resolution images (1024x1024) on mobile devices in 1.4s, and achieve 0.66 on GenEval metric.

Abstract

Existing text-to-image (T2I) diffusion models face several limitations, including large model sizes, slow runtime, and low-quality generation on mobile devices. This paper aims to address all of these challenges by developing an extremely small and fast T2I model that generates high-resolution and high-quality images on mobile platforms. We propose several techniques to achieve this goal. First, we systematically examine the design choices of the network architecture to reduce model parameters and latency, while ensuring high-quality generation. Second, to further improve generation quality, we employ cross-architecture knowledge distillation from a much larger model, using a multi-level approach to guide the training of our model from scratch. Third, we enable a few-step generation by integrating adversarial guidance with knowledge distillation. For the first time, our model SnapGen, demonstrates the generation of 10242 px images on a mobile device around 1.4 seconds. On ImageNet-1K, our model, with only 372M parameters, achieves an FID of 2.06 for 2562 px generation. On T2I benchmarks (i.e., GenEval and DPG-Bench), our model with merely 379M parameters, surpasses large-scale models with billions of parameters at a significantly smaller size (e.g., 7× smaller than SDXL, 14× smaller than IF-XL).

Efficient Architecture

We conduct an in-depth examination of network architectures, including the denoising UNet and Autoencoder (AE), to obtain optimal trade-off between latency and performance. Unlike prior works that optimize and compress pre-trained diffusion models, we directly focus on macro- and micro-level design choices to achieve a novel architecture that greatly reduces model size and computational complexity, while preserving high-quality generation.

unet

Efficient Training

We introduce several improvements to train a compact T2I model from scratch. We propose a multi-level knowledge distillation with a timestep-aware scaling that combines multiple training objectives. We perform step distillation on our model by combining the adversarial training along with the knowledge distillation using a few-step teacher model.

overview

Mobile Demo on iPhone 16 Pro-Max

Quantitative Comparison

Human evaluation vs. SDXL, SD3-Medium and SD3.5-Large:

overview

Comparison with existing T2I models across various benchmarks:

overview

Qualitative Results

Few Step Visualization:

overview

More Visualization Comparison:

overview

BibTeX

    `
        @article{hu2024snapgen,
            title        = {SnapGen: Taming High-Resolution Text-to-Image Models for Mobile Devices with Efficient Architectures and Training},
            author       = {Dongting Hu and Jierun Chen and Xijie Huang and Huseyin Coskun and Arpit Sahni and Aarush Gupta and Anujraaj Goyal and Dishani Lahiri and Rajesh Singh and Yerlan Idelbayev and Junli Cao and Yanyu Li and Kwang-Ting Cheng and S.-H. Chan and Mingming Gong and Sergey Tulyakov and Anil Kag and Yanwu Xu and Jian Ren},
            journal      = {arXiv:2412.09619 [cs.CV]},
            year         = {2024}
        }            
    `