Unsupervised Volumetric Animation (original) (raw)

1Snap Inc. 2University of Trento 3KAUST

*Work done while interning at Snap.

Abstract

We propose a novel approach for unsupervised 3D animation of non-rigid deformable objects. Our method learns the 3D structure and dynamics of objects solely from single-view RGB videos, and can decompose them into semantically meaningful parts that can be tracked and animated. Using a 3D autodecoder framework, paired with a keypoint estimator via a differentiable PnP algorithm, our model learns the underlying object geometry and parts decomposition in an entirely unsupervised manner. This allows it to perform 3D segmentation, 3D keypoint estimation, novel view synthesis, and animation. We primarily evaluate the framework on two video datasets: VoxCeleb 2562 and TEDXPeople 2562. In addition, on the Cats 2562 image dataset, we show it even learns compelling 3D geometry from still images. Finally, we show our model can obtain animatable 3D objects from a single or few images.

Novel view synthesis:

Here we show a reconstruction made from a single image and rotate it along the y-axis, showing a wide range of novel views. We also show the corresponding depth, normals and LBS weights.

Novel view synthesis on image dataset:

Here the model is trained on images only. Despite this, due to the 3D inductive bias provided by PnP, our method discovers meaningful geometry even in this challenging case.

Comparison of direct pose prediction and PnP-based:

We argue that the proposed framework involving differentiable PnP favors discovery of correct 3D geometry. To show it we give qualitative samples of the model using PnP and the one predicting the pose of each part directly, using a neural network. In this experiment we use the result of G-phase, where only a single part is learned. The Direct method learned flat geometry, while our PnP-based method produced plausible geometry with small details, including hair and wrinkles.