DeepHPS: End-to-end Estimation of 3D Hand Pose and Shape by Learning from Synthetic Depth (original) (raw)

Simple and effective deep hand shape and pose regression from a single depth image

Computers & Graphics, 2019

Simultaneously estimating the 3D shape and pose of a hand in real time is a new and challenging computer graphics problem, which is important for animation and interactions with 3D objects in virtual environments with personalized hand shapes. CNN-based direct hand pose estimation methods are the state-of-the-art approaches, but they can only regress a 3D hand pose from a single depth image. In this study, we developed a simple and effective real-time CNN-based direct regression approach for simultaneously estimating the 3D hand shape and pose, as well as structure constraints for both egocentric and third person viewpoints by learning from the synthetic depth. In addition, we produced the first million-scale egocentric synthetic dataset called SynHandEgo, which contains egocentric depth images with accurate shape and pose annotations, as well as color segmentation of the hand parts. Our network is trained based on combined real and synthetic datasets with full supervision of the hand pose and structure constraints, and semi-supervision of the hand mesh. Our approach performed better than the state-of-the-art methods based on the SynHand5M synthetic dataset in terms of both the 3D shape and pose recovery. By learning simultaneously using real and synthetic data, we demonstrated the feasibility of hand mesh recovery from two real hand pose datasets, i.e., BigHand2.2M and NYU. Moreover, our method obtained more accurate estimates of the 3D hand poses based on the NYU dataset compared with the existing methods that output more than joint positions. The SynHandEgo dataset has been made publicly available to promote further research in the emerging domain of hand shape and pose recovery from egocentric viewpoints (https://bit.ly/2WMWM5u).

HandVoxNet: Deep Voxel-Based Network for 3D Hand Shape and Pose Estimation From a Single Depth Map

2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

3D hand shape and pose estimation from a single depth map is a new and challenging computer vision problem with many applications. The state-of-the-art methods directly regress 3D hand meshes from 2D depth images via 2D convolutional neural networks, which leads to artefacts in the estimations due to perspective distortions in the images. In contrast, we propose a novel architecture with 3D convolutions trained in a weakly-supervised manner. The input to our method is a 3D voxelized depth map, and we rely on two hand shape representations. The first one is the 3D voxelized grid of the shape which is accurate but does not preserve the mesh topology and the number of mesh vertices. The second representation is the 3D hand surface which is less accurate but does not suffer from the limitations of the first representation. We combine the advantages of these two representations by registering the hand surface to the voxelized hand shape. In the extensive experiments, the proposed approach improves over the state of the art by 47.8% on the SynHand5M dataset. Moreover, our augmentation policy for voxelized depth maps further enhances the accuracy of 3D hand pose estimation on real data. Our method produces visually more reasonable and realistic hand shapes on NYU and BigHand2.2M datasets compared to the existing approaches.

HandVoxNet++: 3D Hand Shape and Pose Estimation using Voxel-Based Neural Networks

IEEE Transactions on Pattern Analysis and Machine Intelligence

3D hand shape and pose estimation from a single depth map is a new and challenging computer vision problem with many applications. Existing methods addressing it directly regress hand meshes via 2D convolutional neural networks, which leads to artifacts due to perspective distortions in the images. To address the limitations of the existing methods, we develop HandVoxNet++, i.e., a voxel-based deep network with 3D and graph convolutions trained in a fully supervised manner. The input to our network is a 3D voxelized-depth-map-based on the truncated signed distance function (TSDF). HandVoxNet++ relies on two hand shape representations. The first one is the 3D voxelized grid of hand shape, which does not preserve the mesh topology and which is the most accurate representation. The second representation is the hand surface that preserves the mesh topology. We combine the advantages of both representations by aligning the hand surface to the voxelized hand shape either with a new neural Graph-Convolutions-based Mesh Registration (GCN-MeshReg) or classical segment-wise Non-Rigid Gravitational Approach (NRGA++) which does not rely on training data. In extensive evaluations on three public benchmarks, i.e., SynHand5M, depth-based HANDS19 challenge and HO-3D, the proposed HandVoxNet++ achieves the state-of-the-art performance. In this journal extension of our previous approach presented at CVPR 2020, we gain 41.09% and 13.7% higher shape alignment accuracy on SynHand5M and HANDS19 datasets, respectively. Our method is ranked first on the HANDS19 challenge dataset (Task 1: Depth-Based 3D Hand Pose Estimation) at the moment of the submission of our results to the portal in August 2020.

Structure-Aware 3D Hand Pose Regression from a Single Depth Image

Virtual Reality and Augmented Reality, 2018

Hand pose tracking in 3D is an essential task for many virtual reality (VR) applications such as games and manipulating virtual objects with bare hands. CNN-based learning methods achieve the state-of-the-art accuracy by directly regressing 3D pose from a single depth image. However, the 3D pose estimated by these methods is coarse and kinematically unstable due to independent learning of sparse joint positions. In this paper, we propose a novel structureaware CNN-based algorithm which learns to automatically segment the hand from a raw depth image and estimate 3D hand pose jointly with new structural constraints. The constraints include fingers lengths, distances of joints along the kinematic chain and fingers inter-distances. Learning these constraints help to maintain a structural relation between the estimated joint keypoints. Also, we convert sparse representation of hand skeleton to dense by performing n-points interpolation between the pairs of parent and child joints. By comprehensive evaluation, we show the effectiveness of our approach and demonstrate competitive performance to the state-of-the-art methods on the public NYU hand pose dataset.

Depth-Based 3D Hand Pose Estimation: From Current Achievements to Future Goals

2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018

In this paper, we strive to answer two questions: What is the current state of 3D hand pose estimation from depth images? And, what are the next challenges that need to be tackled? Following the successful Hands In the Million Challenge (HIM2017), we investigate the top 10 state-ofthe-art methods on three tasks: single frame 3D pose estimation, 3D hand tracking, and hand pose estimation during object interaction. We analyze the performance of different CNN structures with regard to hand shape, joint visibility, view point and articulation distributions. Our findings include: (1) isolated 3D hand pose estimation achieves low mean errors (10 mm) in the view point range of [70, 120] degrees, but it is far from being solved for extreme view points; (2) 3D volumetric representations outperform 2D CNNs, better capturing the spatial structure of the depth data; (3) Discriminative methods still generalize poorly to unseen hand shapes; (4) While joint occlusions pose a challenge for most methods, explicit modeling of structure constraints can significantly narrow the gap between errors on visible and occluded joints.

How to Refine 3D Hand Pose Estimation from Unlabelled Depth Data?

2017 International Conference on 3D Vision (3DV)

Data-driven approaches for hand pose estimation from depth images usually require a substantial amount of labelled training data which is quite hard to obtain. In this work, we show how a simple convolutional neural network, pre-trained only on synthetic depth images generated from a single 3D hand model, can be trained to adapt to unlabelled depth images from a real user's hand. We validate our method on two existing and a new dataset that we capture, both quantitatively and qualitatively, demonstrating that we strongly compare to state-of-the-art methods. Additionally, this method can be seen as an extension to existing methods trained on limited datasets, which helps on boosting their performance on new ones.

3D Hand-Object Pose Estimation from Depth with Convolutional Neural Networks

2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017), 2017

Estimating the 3D pose of a hand interacting with an object is a challenging task, harder than hand-only pose estimation as the object can cause heavy occlusion on the hand. We present a two stage discriminative approach using convolutional neural networks (CNN). The first stage classifies and segments the object pixels from a depth image containing the hand and object. This processed image is used to aid the second stage in estimating hand-object pose as it contains information regarding the object location and object occlusion. To the best of our knowledge, this is the first attempt at discriminative one shot hand-object pose estimation. We show that this approach outperforms the current state-of-theart and that the inclusion of a segmentation stage to learned discriminative single stage systems improves their performance.

Back To RGB: Deep articulated hand pose estimation from a single camera image

2017 International Conference on Image and Vision Computing New Zealand (IVCNZ), 2017

In this work, we demonstrate a method called Deep Hand Pose Machine(DHPM) that effectively detects the anatomical joints in the human hand based on single RGB images. Current state-of-the-art methods are able to robustly infer hand poses from RGB-D images. However, the depth map from an infrared camera does not operate well under direct sunlight. Performing hand tracking outdoors using depth sensors results in unreliable depth information and inaccurate poses. For this reason we were motivated to create this method which solely utilizes ordinary RGB image without additional depth information. Our approach adapts the pose machine algorithm, which has been used in the past to detect human body joints. We perform pose machine training on synthetic data to accurately predict the position of the joints in a real hand image.

MM-Hand: 3D-Aware Multi-Modal Guided Hand Generative Network for 3D Hand Pose Synthesis

2020

Estimating the 3D hand pose from a monocular RGB image is important but challenging. A solution is training on large-scale RGB hand images with accurate 3D hand keypoint annotations. However, it is too expensive in practice. Instead, we have developed a learning-based approach to synthesize realistic, diverse, and 3D pose-preserving hand images under the guidance of 3D pose information. We propose a 3D-aware multi-modal guided hand generative network (MM-Hand), together with a novel geometry-based curriculum learning strategy. Our extensive experimental results demonstrate that the 3D-annotated images generated by MM-Hand qualitatively and quantitatively outperform existing options. Moreover, the augmented data can consistently improve the quantitative performance of the state-of-the-art 3D hand pose estimators on two benchmark datasets. The code will be available at this https URL.

Hand pose estimation from depth data with Convolutional Neural Networks

2017

The estimation of hand position and orientation, pose, is of special interest in many applications related to Human Robot Interaction, such as human activity recognition, sign language interpretation, or as a human computer interface in virtual reality systems, advanced entertainment games, gesture-driven interfaces, and in teleoperated or in autonomous robotic systems. This project focusses on the problem of hand pose estimation using convolutional neural networks (CNN) from depth data. Recently, dierent CNN architectures have been proposed in order to nd an ecent and reliable methodolgy to resolve the complexity that involves the variablity in apperance of a hand, with its gestures, changes of orientation, occlusions and so. The use of CNN opens new opportunities for improvements in this research by providing the capability of learning from many samples. This work pretends to advance a step further on the hand pose estimation problem. With this aim, the hand pose estimation using ...