Hierarchical Patch Diffusion Models for High-Resolution Video Generation (original) (raw)

Just chekcing that no SEO is included.

Abstract

Diffusion models have demonstrated remarkable performance in image and video synthesis. However, scaling them to high-resolution inputs is challenging and requires restructuring the diffusion pipeline into multiple independent components, limiting scalability and complicating downstream applications. This makes it very efficient during training and unlocks end-to-end optimization on high-resolution videos. We improve PDMs in two principled ways. First, to enforce consistency between patches, we develop deep context fusion — an architectural technique that propagates the context information from low-scale to high-scale patches in a hierarchical manner. Second, to accelerate training and inference, we propose adaptive computation, which allocates more network capacity and computation towards coarse image details. The resulting model sets a new state-of-the-art FVD score of 66.32 and Inception Score of 87.68 in class-conditional video generation on UCF-101 , surpassing recent methods by more than 100%. Then, we show that it can be rapidly fine-tuned from a base low-resolution generator for high-resolution text-to-video synthesis. To the best of our knowledge, our model is the first diffusion-based architecture which is trained on such high resolutions entirely end-to-end.

Existing video diffusion paradigms

Comparing existing diffusion paradigms: Latent Diffusion Model (LDM) (upper left), Cascaded Diffusion Model (CDM) (bottom left), and Patch Diffusion Model (this work) during training (upper right) and inference (bottom right). In our work, we develop hierarchical patch diffusion, which never operates on full-resolution inputs, but instead optimizes the lower stages of the hierarchy to produce spatially aligned context information for the later pyramid levels to enforce global consistency between patches.


Architecture overview

Architecture overview of HPDM for a 3-level pyramid. The model is trained to denoise all the patches jointly. During training, we use only a single patch from each pyramid level and restrict information propagation in the coarse-to-fine manner. This allows one to synthesize the whole image (or video) at a given resolution patch-by-patch using tiled inference.


Quantitative results

Method FVD↓ InceptionScore↑
MoCoGAN-HD 700 33.95
TATS 635 57.63
VIDM 294.7 -
PVDM 343.6 74.4
Make-A-Video 81.25 82.55
HDPM-S 344.5 73.73
HPDM-M 143.1 84.29
HPDM-L 66.32 87.68

Note: please, use the latest version of Chrome/Chromium or Safari to watch the videos (alternatively, you can download a video and watch it offline). Some of the videos can be displayed incorrectly in other web browsers (e.g., Firefox).


Video generation results on UCF101

HPDM 64x256x256 (ours; random samples)

PVDM 128x256x256 (provided samples)

PVDM 16x256x256 (provided samples)

DIGAN 128x128x128 (provided samples)

StyleGAN-V 128x256x256 (provided samples)


Text-to-video generation results (our prompts)

HPDM-T2V (ours) --- "A robot planting a tree."

HPDM-T2V (ours) --- "A high-definition video of a pack of wolves hunting in a snowy forest, natural behavior, dynamic angles."

HPDM-T2V (ours) --- "A detailed animation of an ancient Egyptian city, with the Nile river and pyramids, 4K, historically accurate."

HPDM-T2V (ours) --- "A 4K time-lapse of a blooming rose, showing each stage of the flower opening."


Text-to-video generation results (comparison)

A confused grizzly bear in calculus class.

Humans building a highway on mars, highly detailed.

Sailboat sailing on a sunny day in a mountain lake, highly detailed.

A panda bear driving a car.

A panda eating bamboo on a rock.

A shark swimming in clear Carribean ocean.

A stunning aerial drone footage time lapse of El Capitan in Yosemite National Park at sunset.

A teddy bear skating in Times Square.

A cute rabbit is eating grass, wildlife photography, photograph, high quality, wildlife, f 1.8, soft focus, 8k, award - winning photograph.

A very happy fuzzy panda dressed as a chef eating pizza in the New York street food truck.

The supernova explosion of a white dwarf in the universe, photo realistic, 8k, cinematic lighting, hd, atmospheric, hyperdetailed, photography, glow effect.

An epic tornado attacking above a glowing city at night, the tornado is made of smoke, highly detailed.

Glass sphere filled with swirling multicolored liquid, cinematic lighting.

A high quality 3D render of hyperrealist, super strong, multicolor stripped, and fluffy bear with wings, highly detailed, sharp focus.

Just chekcing that no footer is included.