4Real-Video Learning Generalizable Photo-Realistic 4D Video Diffusion (original) (raw)
Peiye Zhuang1,*, Tuan Duc Ngo1,2,*, Willi Menapace1, Aliaksandr Siarohin1, Michael Vasilkovsky1, Ivan Skorokhodov1, Sergey Tulyakov1, Peter Wonka1,3, Hsin-Ying Lee1
1Snap Inc., 2Umass Amherst, 3KAUST
Hover to view author contribution details
Chaoyang
Architecture design, 4D video model implementation & training, video upsampler training, data processing, evaluation, paper writing, demo preparation.
Peiye
Technical discussion, base video model training, data processing, evaluation.
Tuan
4D reconstruction, evaluation, demo preparation.
Willi & Aliaksandr
Technical discussion, base video model implementation.
Michael
Data preparation.
Ivan & Sergey
Technical discussion.
Peter
Technical advisory, paper writing.
Hsin-Ying
Technical advisory, paper writing, demo preparation.
*main contributor, †project lead
4Real-Video is a diffusion model that generates 4D video -- a grid of video frames with both time and viewpoint axes.
Each row contains frames sharing the same timestep, while each column contains frames from the same viewpoint.
Animate Real-World Scene in 4D
Given a casually captured real-world scene, 4Real-Video can transform it into 4D animations driven by text prompts.
(Click on the images below to select input scenes and prompts.)
"a yellow toy building brick turn into a cat."
"move the truck to the right."
"a cat pillow dancing."
"a ghostly toy dancing."
"a ghostly toy dancing."
Your browser does not support the video tag.
Your browser does not support the video tag.
Animate 3D Assets
4Real-Video can also animate 3D assets seamlessly across multiple views.
(Click on the images below to select input 3D assets.)
Your browser does not support the video tag.
Your browser does not support the video tag.
Generate 4D Videos from Text
Finally, 4Real can create 4D videos directly from text input.
"A panda eating ice-cream."
A panda eating ice-cream.
A panda eating ice-cream.
"A bulldog wearing a black pirate hat eating candy."
A bulldog wearing a black pirate hat eating candy.
A bulldog wearing a black pirate hat eating candy.
Interactive 4D Viewer
Click on the thumbnails below to explore the reconstructed 4D scene with deformable 3D Gaussian Splats directly in your browser, powered by Brush.
4Real-Video Overview: Left: we initialize the grid of frames with a (generated or real) fixed-viewpoint video in the first row and a freeze-time video in the first column. Middle: our architecture consists of two parallel token streams. The top part processes \(\mathbf{x}_l^\text{v}\) with viewpoint updates and the bottom part processes \(\mathbf{x}_l^\text{t}\) with temporal updates. Subsequently, a synchronization layer computes the new tokens \(\mathbf{x}_{l+1}^\text{v}\) \(\mathbf{x}_{l+1}^\text{t}\) for the next layer in the diffusion transformer architecture. Right: we propose two implementations of the synchronization layer: hard and soft synchronization.
More Text-4D video Generation Results
fixed-view videos | freeze-time videos |
---|---|
fixed view videos | freeze-time videos |
---|---|
fixed-view videos | freeze-time videos |
---|---|
fixed view videos | freeze-time videos |
---|---|
fixed-view videos | freeze-time videos |
---|---|
fixed-view videos | freeze-time videos |
---|---|
fixed-view videos | freeze-time videos |
---|---|
fixed-view videos | freeze-time videos |
---|---|
Compare to MotionCtrl and SV4D
MotionCtrl struggles to generate temporally coherent videos because its freeze-time frames are produced independently, neglecting temporal dependencies. Additionally, it often generates minimal camera motion even when provided with high input speed. Conversely, SV4D produces poor visual quality as realistic scene generation falls outside its training domain. In contrast, our method generates realistic and temporally coherent frame grids, ensuring high-quality video outputs.
4Real-Video (Ours) | MotionCtrl | SV4D |
---|---|---|