Recurrent Encoder Multi-Decoder Networks, Semi-supervised Classification and Constrained Adversarial Generation (original) (raw)

Recurrent semi-supervised classification and constrained adversarial generation with motion capture data

Image and Vision Computing

We explore recurrent encoder multi-decoder neural network architectures for semi-supervised sequence classification and reconstruction. We find that the use of multiple reconstruction modules helps models generalize in a classification task when only a small amount of labeled data is available, which is often the case in practice. Such models provide useful high-level representations of motions allowing clustering, searching and faster labeling of new sequences. We also propose a new, realistic partitioning of a well-known, high quality motion-capture dataset for better evaluations. We further explore a novel formulation for future-predicting decoders based on conditional recurrent generative adversarial networks, for which we propose both soft and hard constraints for transition generation derived from desired physical properties of synthesized future movements and desired animation goals. We find that using such constraints allow to stabilize the training of recurrent adversarial architectures for animation generation.

Semi-supervised Learning with Encoder-Decoder Recurrent Neural Networks: Experiments with Motion Capture Sequences

Recent work on sequence to sequence translation using Recurrent Neural Networks (RNNs) based on Long Short Term Memory (LSTM) architectures has shown great potential for learning useful representations of sequential data. A one-to-many encoder-decoder(s) scheme allows for a single encoder to provide representations serving multiple purposes. In our case, we present an LSTM encoder network able to produce representations used by two decoders: one that reconstructs, and one that classifies if the training sequence has an associated label. This allows the network to learn representations that are useful for both discriminative and reconstructive tasks at the same time. This paradigm is well suited for semi-supervised learning with sequences and we test our proposed approach on an action recognition task using motion capture (MOCAP) sequences. We find that semi-supervised feature learning can improve state-of-the-art movement classification accuracy on the HDM05 action dataset. Further, we find that even when using only labeled data and a primarily discriminative objective the addition of a reconstructive decoder can serve as a form of regularization that reduces over-fitting and improves test set accuracy.

Future Frame Prediction using Generative Adversarial Networks Future Frame Prediction using Generative Adversarial Networks

Karbala International Journal of Modern Science, 2023

The ability of a human to anticipate what is going to happen in the near future given the current situation helps in making intelligent decisions about how to react in that situation. In this paper, we have developed multiple Deep Neural Network models, intending to generate the next frame in a sequence given previous frames. In recent years, Generative Adversarial Networks (GAN) have shown promising results in the field of image generation. Hence, in this paper, we aim to create and compare two Generative Adversarial Models created for Future Frame Prediction by combining GANs with convolutional neural networks, Long Short-Term Memory Networks, and Convolutional LSTM networks. Based on the state-of-the-art, we have tried to improve the results of our model both visually and numerically. The paper is summarized by comparing the outputs of our two models and then finally comparing them with previously developed models for this purpose and providing future scope for research. Both the models presented in this work perform well based on certain aspects of future frame prediction. The results presented in this paper are crucial in the field of future prediction, in fields such as robotics, autonomous driving, and autonomous agent development.

Animgan: A Spatiotemporally-Conditioned Generative Adversarial Network For Character Animation

2020

Producing realistic character animations is one of the essential tasks in human-AI interactions. Considered as a sequence of poses of a humanoid, the task can be considered as a sequence generation problem with spatiotemporal smoothness and realism constraints. Additionally, we wish to control the behavior of AI agents by giving them what to do and, more specifically, how to do it. We proposed a spatiotemporally-conditioned GAN that generates a sequence that is similar to a given sequence in terms of semantics and spatiotemporal dynamics. Using LSTM-based generator and graph ConvNet discriminator, this system is trained end-to-end on a large gathered dataset of gestures, expressions, and actions. Experiments showed that compared to traditional conditional GAN, our method creates plausible, realistic, and semantically relevant humanoid animation sequences that match user expectations.

A Causal Convolutional Neural Network for Motion Modeling and Synthesis

2021

We propose a novel deep generative model based on causal convolutions for multi-subject motion modeling and synthesis, which is inspired by the success of WaveNet in multi-subject speech synthesis. However, it is nontrivial to adapt WaveNet to handle high-dimensional and physically constrained motion data. To this end, we add an encoder and a decoder to the WaveNet to translate the motion data into features and back to the predicted motions. We also add 1D convolution layers to take skeleton configuration as an input to model skeleton variations across different subjects. As a result, our network can scale up well to large-scale motion data sets across multiple subjects and support various applications, such as random and controllable motion synthesis, motion denoising, and motion completion, in a unified way. Complex motions, such as punching, kicking and, kicking while punching, are also well handled. Moreover, our network can synthesize motions for novel skeletons not in the trai...

Towards Image-to-Video Translation: A Structure-Aware Approach via Multi-stage Generative Adversarial Networks

International Journal of Computer Vision, 2020

In this paper, we consider the problem of image-to-video translation, where one or a set of input images are translated into an output video which contains motions of a single object. Especially, we focus on predicting motions conditioned by highlevel structures, such as facial expression and human pose. Recent approaches are either driven by structural conditions or temporal-based. Condition-driven approaches typically train transformation networks to generate future frames conditioned on the predicted structural sequence. Temporal-based approaches, on the other hand, have shown that short high-quality motions can be generated using 3D convolutional networks with temporal knowledge learned from massive training data. In this work, we combine the benefits of both approaches and propose a two-stage generative framework where videos are forecast from the structural sequence and then refined by temporal signals. To model motions more efficiently in the forecasting stage, we train networks with dense connections to learn residual motions between the current and future frames, which avoids learning motion-irrelevant details. To ensure temporal consistency in the refining stage, we adopt the ranking loss for adversarial training. We conduct extensive experiments on two image-to-video translation tasks: facial expression retargeting and human pose forecasting. Superior results over the state of the art on both tasks demonstrate the effectiveness of our approach.

Generating Videos with Dynamics-aware Implicit Generative Adversarial Networks

2022

In the deep learning era, long video generation of high-quality still remains challenging due to the spatio-temporal complexity and continuity of videos. Existing prior works have attempted to model video distribution by representing videos as 3D grids of RGB values, which impedes the scale of generated videos and neglects continuous dynamics. In this paper, we found that the recent emerging paradigm of implicit neural representations (INRs) that encodes a continuous signal into a parameterized neural network effectively mitigates the issue. By utilizing INRs of video, we propose dynamics-aware implicit generative adversarial network (DI-GAN), a novel generative adversarial network for video generation. Specifically, we introduce (a) an INR-based video generator that improves the motion dynamics by manipulating the space and time coordinates differently and (b) a motion discriminator that efficiently identifies the unnatural motions without observing the entire long frame sequences. We demonstrate the superiority of DIGAN under various datasets, along with multiple intriguing properties, e.g., long video synthesis, video extrapolation, and non-autoregressive video generation. For example, DIGAN improves the previous state-of-the-art FVD score on UCF-101 by 30.7% and can be trained on 128 frame videos of 128×128 resolution, 80 frames longer than the 48 frames of the previous state-of-the-art method. 1

Learning to Forecast and Refine Residual Motion for Image-to-Video Generation

Lecture Notes in Computer Science, 2018

We consider the problem of image-to-video translation, where an input image is translated into an output video containing motions of a single object. Recent methods for such problems typically train transformation networks to generate future frames conditioned on the structure sequence. Parallel work has shown that short high-quality motions can be generated by spatiotemporal generative networks that leverage temporal knowledge from the training data. We combine the benefits of both approaches and propose a two-stage generation framework where videos are generated from structures and then refined by temporal signals. To model motions more efficiently, we train networks to learn residual motion between the current and future frames, which avoids learning motion-irrelevant details. We conduct extensive experiments on two image-to-video translation tasks: facial expression retargeting and human pose forecasting. Superior results over the state-of-the-art methods on both tasks demonstrate the effectiveness of our approach. 3

MoDi: Unconditional Motion Synthesis from Diverse Data

arXiv (Cornell University), 2022

Figure 1. Our generative model is learned in an unsupervised setting from a diverse, unstructured and unlabeled motion dataset and yields a highly semantic, clustered, latent space that facilitates synthesis operations. An encoder and a mapping network enable the employment of real and generated motions, respectively.

The Pose Knows: Video Forecasting by Generating Pose Futures

2017 IEEE International Conference on Computer Vision (ICCV), 2017

Current approaches in video forecasting attempt to generate videos directly in pixel space using Generative Adversarial Networks (GANs) or Variational Autoencoders (VAEs). However, since these approaches try to model all the structure and scene dynamics at once, in unconstrained settings they often generate uninterpretable results. Our insight is to model the forecasting problem at a higher level of abstraction. Specifically, we exploit human pose detectors as a free source of supervision and break the video forecasting problem into two discrete steps. First we explicitly model the high level structure of active objects in the scene-humans-and use a VAE to model the possible future movements of humans in the pose space. We then use the future poses generated as conditional information to a GAN to predict the future frames of the video in pixel space. By using the structured space of pose as an intermediate representation, we sidestep the problems that GANs have in generating video pixels directly. We show through quantitative and qualitative evaluation that our method outperforms state-of-the-art methods for video prediction.