An Adaptive Viewpoint Transformation Network for 3D Human Pose Estimation (original) (raw)

Learning Monocular 3D Human Pose Estimation from Multi-view Images

2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition

Accurate 3D human pose estimation from single images is possible with sophisticated deep-net architectures that have been trained on very large datasets. However, this still leaves open the problem of capturing motions for which no such database exists. Manual annotation is tedious, slow, and error-prone. In this paper, we propose to replace most of the annotations by the use of multiple views, at training time only. Specifically, we train the system to predict the same pose in all views. Such a consistency constraint is necessary but not sufficient to predict accurate poses. We therefore complement it with a supervised loss aiming to predict the correct pose in a small set of labeled images, and with a regularization term that penalizes drift from initial predictions. Furthermore, we propose a method to estimate camera pose jointly with human pose, which lets us utilize multiview footage where calibration is difficult, e.g., for pan-tilt or moving handheld cameras. We demonstrate the effectiveness of our approach on established benchmarks, as well as on a new Ski dataset with rotating cameras and expert ski motion, for which annotations are truly hard to obtain.

A Simple Yet Effective Baseline for 3d Human Pose Estimation

2017 IEEE International Conference on Computer Vision (ICCV), 2017

Following the success of deep convolutional networks, state-of-the-art methods for 3d human pose estimation have focused on deep end-to-end systems that predict 3d joint locations given raw image pixels. Despite their excellent performance, it is often not easy to understand whether their remaining error stems from a limited 2d pose (visual) understanding, or from a failure to map 2d poses into 3dimensional positions. With the goal of understanding these sources of error, we set out to build a system that given 2d joint locations predicts 3d positions. Much to our surprise, we have found that, with current technology, "lifting" ground truth 2d joint locations to 3d space is a task that can be solved with a remarkably low error rate: a relatively simple deep feedforward network outperforms the best reported result by about 30% on Human3.6M, the largest publicly available 3d pose estimation benchmark. Furthermore, training our system on the output of an off-the-shelf state-of-the-art 2d detector (i.e., using images as input) yields state of the art results-this includes an array of systems that have been trained end-to-end specifically for this task. Our results indicate that a large portion of the error of modern deep 3d pose estimation systems stems from their visual analysis, and suggests directions to further advance the state of the art in 3d human pose estimation.

Learning Camera Viewpoint Using CNN to Improve 3D Body Pose Estimation

2016 Fourth International Conference on 3D Vision (3DV), 2016

The objective of this work is to estimate 3D human pose from a single RGB image. Extracting image representations which incorporate both spatial relation of body parts and their relative depth plays an essential role in accurate 3D pose reconstruction. In this paper, for the first time, we show that camera viewpoint in combination to 2D joint locations significantly improves 3D pose accuracy without the explicit use of perspective geometry mathematical models. To this end, we train a deep Convolutional Neural Network (CNN) to learn categorical camera viewpoint. To make the network robust against clothing and body shape of the subject in the image, we utilized 3D computer rendering to synthesize additional training images. We test our framework on the largest 3D pose estimation benchmark, Human3.6m, and achieve up to 20% error reduction compared to the state-of-the-art approaches that do not use body part segmentation 1 .

Learning Latent Representations of 3D Human Pose with Deep Neural Networks

Most recent approaches to monocular 3D pose estimation rely on Deep Learning. They either train a Convolutional Neural Network to directly regress from an image to a 3D pose, which ignores the dependencies between human joints, or model these dependencies via a max-margin structured learning framework, which involves a high computational cost at inference time. In this paper, we introduce a Deep Learning regression architecture for structured prediction of 3D human pose from monocular images or 2D joint location heatmaps that relies on an overcomplete autoencoder to learn a high-dimensional latent pose representation and accounts for joint dependencies. We further propose an efficient Long Short-Term Memory (LSTM) network to enforce temporal consistency on 3D pose predictions. We demonstrate that our approach achieves state-of-the-art performance both in terms of structure preservation and prediction accuracy on standard 3D human pose estimation benchmarks.

Three stage deep network for 3D human pose reconstruction by exploiting spatial and temporal data via its 2D pose

Journal of Visual Communication and Image Representation, 2020

3D Human Pose Reconstruction (HPR) is a challenging task due to less availability of 3D ground truth data and projection ambiguity. To address these limitations, we propose a three-stage deep network having the workflow of 2D Human Pose Estimation (HPE) followed by 3D HPR; which utilizes the proposed Frame Specific Pose Estimation (FSPE), Multi-Stage Cascaded Feature Connection (MSCFC) and Feature Residual Connection (FRC) Sub-level Strategies. In the first stage, the FSPE concept with the MSCFC strategy has been used for 2D HPE. In the second stage, the basic deep learning concepts like convolution, batch normalization, ReLU, and dropout have been utilized with the FRC Strategy for spatial 3D reconstruction. In the last stage, LSTM deep architecture has been used for temporal refinement. The effectiveness of the technique has been demonstrated on MPII, Human3.6M, and HumanEva-I datasets. From the experiments, it has been observed that the proposed method gives competitive results to the recent state-of-the-art techniques. ✩ This paper has been recommended for acceptance by Zicheng Liu.

Adapted Human Pose: Monocular 3D Human Pose Estimation with Zero Real 3D Pose Data

2021

The ultimate goal for an inference model is to be robust and functional in real life applications. However, training vs. test data domain gaps often negatively affect model performance. This issue is especially critical for the monocular 3D human pose estimation problem, in which 3D human data is often collected in a controlled lab setting. In this paper, we focus on alleviating the negative effect of domain shift by presenting our adapted human pose (AHuP) approach that addresses adaptation problems in both appearance and pose spaces. AHuP is built around a practical assumption that in real applications, data from target domain could be inaccessible or only limited information can be acquired. We illustrate the 3D pose estimation performance of AHuP in two scenarios. First, when source and target data differ significantly in both appearance and pose spaces, in which we learn from synthetic 3D human data (with zero real 3D human data) and show comparable performance with the state-o...

3D human pose estimation from depth maps using a deep combination of poses

Journal of Visual Communication and Image Representation

Many real-world applications require the estimation of human body joints for higher-level tasks as, for example, human behaviour understanding. In recent years, depth sensors have become a popular approach to obtain three-dimensional information. The depth maps generated by these sensors provide information that can be employed to disambiguate the poses observed in two-dimensional images. This work addresses the problem of 3D human pose estimation from depth maps employing a Deep Learning approach. We propose a model, named Deep Depth Pose (DDP), which receives a depth map containing a person and a set of predefined 3D prototype poses and returns the 3D position of the body joints of the person. In particular, DDP is defined as a ConvNet that computes the specific weights needed to linearly combine the prototypes for the given input. We have thoroughly evaluated DDP on the challenging 'ITOP' and 'UBC3V' datasets, which respectively depict realistic and synthetic samples, defining a new state-of-the-art on them.

Fast Single-Person 2D Human Pose Estimation Using Multi-Task Convolutional Neural Networks

ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

This paper presents a novel neural module for enhancing existing fast and lightweight 2D human pose estimation CNNs, in order to increase their accuracy. A baseline stem CNN is augmented by a collateral module, which is tasked to encode global spatial and semantic information and provide it to the stem network during inference. The latter one outputs the final 2D human pose estimations. Since global information encoding is an inherent subtask of 2D human pose estimation, this particular setup allows the stem network to better focus on the local details of the input image and on precisely localizing each body joint, thus increasing overall 2D human pose estimation accuracy. Furthermore, the collateral module is designed to be lightweight, adding negligible runtime computational cost, so that the unified architecture retains the fast execution property of the stem network. Evaluation of the proposed method on public 2D human pose estimation datasets shows that it increases the accuracy of different baseline stem CNNs, while outperforming all competing fast 2D human pose estimation methods.

Deep, Robust and Single Shot 3D Multi-Person Human Pose Estimation from Monocular Images

2019 IEEE International Conference on Image Processing (ICIP), 2019

In this paper, we propose a new single shot method for multi-person 3D human pose estimation in complex images. The model jointly learns to locate the human joints in the image, to estimate their 3D coordinates and to group these predictions into full human skeletons. The proposed method deals with a variable number of people and does not need bounding boxes to estimate the 3D poses. It leverages and extends the Stacked Hourglass Network and its multiscale feature learning to manage multi-person situations. Thus, we exploit a robust 3D human pose formulation to fully describe several 3D human poses even in case of strong occlusions or crops. Then, joint grouping and human pose estimation for an arbitrary number of people are performed using the associative embedding method. Our approach significantly outperforms the state of the art on the challenging CMU Panoptic. Furthermore, it leads to good results on the complex and synthetic images from the newly proposed JTA Dataset.

Structured Prediction of 3D Human Pose with Deep Neural Networks

Procedings of the British Machine Vision Conference 2016, 2016

Most recent approaches to monocular 3D pose estimation rely on Deep Learning. They either train a Convolutional Neural Network to directly regress from image to 3D pose, which ignores the dependencies between human joints, or model these dependencies via a max-margin structured learning framework, which involves a high computational cost at inference time. In this paper, we introduce a Deep Learning regression architecture for structured prediction of 3D human pose from monocular images that relies on an overcomplete auto-encoder to learn a high-dimensional latent pose representation and account for joint dependencies. We demonstrate that our approach outperforms state-of-the-art ones both in terms of structure preservation and prediction accuracy.