Show Me What and Tell Me How: Video Synthesis via Multimodal Conditioning (original) (raw)

Multimodal VoxCeleb Dataset

Text-to-video generation

Example videos generated by MMVID on the Multimodal VoxCeleb dataset for text-to-video generation. We show three synthesized videos for each input multimodal condition.

Independent multimodal video generation

Example videos generated by MMVID on the Multimodal VoxCeleb dataset for independent multimodal video generation. The input control signals are text and a segmentation mask. We show two synthesized videos for each input multimodal condition.

Samples generated by MMVID conditioned on text and an artistic drawing.

Samples generated by MMVID conditioned on text and a partially observed image.

Dependent multimodal video generation

Example videos generated by MMVID on the Multimodal VoxCeleb dataset for dependent multimodal video generation. The input control signals are text, an image, and a segmentation mask. We show two synthesized videos for each input multimodal condition.

Samples generated by MMVID conditioned on text, an artistic drawing, and a segmentation mask.

Samples generated by MMVID conditioned on text, an image (used for appearance), and a video (used for motion guidance).

Textual Augmentation

Example videos generated by methods w/ (w/ RoBERTa) and w/o (w/o RoBERTa) using language embedding from RoBERTa as text augmentation. Models are trained on the Multimodal VoxCeleb dataset for text-to-video generation. We show three synthesized videos for each input text condition.

Moving Shapes Dataset

Text-to-video generation

Samples generated by our approach on the Moving Shapes dataset for text-to-video generation. We show three synthesized videos for each input text condition.

Independent multimodal video generation

Samples generated by our approach on the Shapes dataset for independent multimodal generation. The input control signals are text and a partially observed image (with the center masked out, shown in white color). We show two synthesized videos for each input multimodal condition.

Dependent multimodal video generation

Samples generated by our approach on the Shapes dataset for dependent multimodal generation. The input control signals are text and images. We show one synthesized video for each input multimodal condition.

iPER Dataset

Long sequence generation

Example videos generated by our approach on the iPER dataset for long sequence generation. The extrapolation process is repeated for each sequence 100 times, resulting in a 107-frame video. The textual input also controls the speed, where "slow" indicates videos with slow speed such that the motion is slow, while "fast" indicates the performed motion is fast. We show one synthesized video for each input text condition. The first video following the text input corresponds to the "slow" condition, the second corresponds to the "normal", and the last corresponds to the "fast".

Temporal Interpolation

Example videos of our approach for video interpolation on iPER dataset.

Supplemental Materials

More supplemental videos can be found at this webpage.