4Real-Video Learning Generalizable Photo-Realistic 4D Video Diffusion (original) (raw)

Peiye Zhuang1,*, Tuan Duc Ngo1,2,*, Willi Menapace1, Aliaksandr Siarohin1, Michael Vasilkovsky1, Ivan Skorokhodov1, Sergey Tulyakov1, Peter Wonka1,3, Hsin-Ying Lee1

1Snap Inc., 2Umass Amherst, 3KAUST

Hover to view author contribution details

Chaoyang

Architecture design, 4D video model implementation & training, video upsampler training, data processing, evaluation, paper writing, demo preparation.

Peiye

Technical discussion, base video model training, data processing, evaluation.

Tuan

4D reconstruction, evaluation, demo preparation.

Willi & Aliaksandr

Technical discussion, base video model implementation.

Michael

Data preparation.

Ivan & Sergey

Technical discussion.

Peter

Technical advisory, paper writing.

Hsin-Ying

Technical advisory, paper writing, demo preparation.

*main contributor, †project lead

4Real-Video is a diffusion model that generates 4D video -- a grid of video frames with both time and viewpoint axes.

Each row contains frames sharing the same timestep, while each column contains frames from the same viewpoint.

Animate Real-World Scene in 4D

Given a casually captured real-world scene, 4Real-Video can transform it into 4D animations driven by text prompts.

(Click on the images below to select input scenes and prompts.)

"a yellow toy building brick turn into a cat."

Video Set 1

"move the truck to the right."

Video Set 2

"a cat pillow dancing."

Video Set 3

"a ghostly toy dancing."

Video Set 4

"a ghostly toy dancing."

Your browser does not support the video tag.

Animate 3D Assets

4Real-Video can also animate 3D assets seamlessly across multiple views.

(Click on the images below to select input 3D assets.)

Your browser does not support the video tag.

Generate 4D Videos from Text

Finally, 4Real can create 4D videos directly from text input.

"A panda eating ice-cream."

A panda eating ice-cream.

"A bulldog wearing a black pirate hat eating candy."

A bulldog wearing a black pirate hat eating candy.

Interactive 4D Viewer

Click on the thumbnails below to explore the reconstructed 4D scene with deformable 3D Gaussian Splats directly in your browser, powered by Brush.

4Real-Video Overview: Left: we initialize the grid of frames with a (generated or real) fixed-viewpoint video in the first row and a freeze-time video in the first column. Middle: our architecture consists of two parallel token streams. The top part processes \(\mathbf{x}_l^\text{v}\) with viewpoint updates and the bottom part processes \(\mathbf{x}_l^\text{t}\) with temporal updates. Subsequently, a synchronization layer computes the new tokens \(\mathbf{x}_{l+1}^\text{v}\) \(\mathbf{x}_{l+1}^\text{t}\) for the next layer in the diffusion transformer architecture. Right: we propose two implementations of the synchronization layer: hard and soft synchronization.

Placeholder Image

Compare to MotionCtrl and SV4D

MotionCtrl struggles to generate temporally coherent videos because its freeze-time frames are produced independently, neglecting temporal dependencies. Additionally, it often generates minimal camera motion even when provided with high input speed. Conversely, SV4D produces poor visual quality as realistic scene generation falls outside its training domain. In contrast, our method generates realistic and temporally coherent frame grids, ensuring high-quality video outputs.

4Real-Video (Ours)	MotionCtrl	SV4D

Acknowledgement

We extend our gratitude to Heng Yu, Moayed Haji Ali, Sherwin Bahmani, Jiahao Luo, and Guochen Qian for their valuable assistance with data preparation and model pretraining. The interactive 4D viewer is borrowed from CAT4D website.