Distribution-Aware Coordinate Representation for Human Pose Estimation (original) (raw)
Related papers
Cornell University - arXiv, 2020
In this paper, we focus on the coordinate representation in human pose estimation. While being the standard choice, heatmap based representation has not been systematically investigated. We found that the process of coordinate decoding (i.e. transforming the predicted heatmaps to the coordinates) is surprisingly significant for human pose estimation performance, which nevertheless was not recognised before. In light of the discovered importance, we further probe the design limitations of the standard coordinate decoding method and propose a principled distribution-aware decoding method. Meanwhile, we improve the standard coordinate encoding process (i.e. transforming ground-truth coordinates to heatmaps) by generating accurate heatmap distributions for unbiased model training. Taking them together, we formulate a novel Distribution-Aware coordinate Representation for Keypoint (DARK) method. Serving as a model-agnostic plug-in, DARK significantly improves the performance of a variety of state-of-the-art human pose estimation models. Extensive experiments show that DARK yields the best results on COCO keypoint detection challenge, validating the usefulness and effectiveness of our novel coordinate representation idea. The project page containing more details is at https://ilovepose. github.io/coco/
3D Human Pose Estimation With 2D Marginal Heatmaps
2019 IEEE Winter Conference on Applications of Computer Vision (WACV), 2019
Automatically determining three-dimensional human pose from monocular RGB image data is a challenging problem. The two-dimensional nature of the input results in intrinsic ambiguities which make inferring depth particularly difficult. Recently, researchers have demonstrated that the flexible statistical modelling capabilities of deep neural networks are sufficient to make such inferences with reasonable accuracy. However, many of these models use coordinate output techniques which are memory-intensive, not differentiable, and/or do not spatially generalise well. We propose improvements to 3D coordinate prediction which avoid the aforementioned undesirable traits by predicting 2D marginal heatmaps under an augmented soft-argmax scheme. Our resulting model, MargiPose, produces visually coherent heatmaps whilst maintaining differentiability. We are also able to achieve state-of-the-art accuracy on publicly available 3D human pose estimation data.
HEMlets Pose: Learning Part-Centric Heatmap Triplets for Accurate 3D Human Pose Estimation
2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019
Estimating 3D human pose from a single image is a challenging task. This work attempts to address the uncertainty of lifting the detected 2D joints to the 3D space by introducing an intermediate state-Part-Centric Heatmap Triplets (HEMlets), which shortens the gap between the 2D observation and the 3D interpretation. The HEMlets utilize three joint-heatmaps to represent the relative depth information of the end-joints for each skeletal body part. In our approach, a Convolutional Network (ConvNet) is first trained to predict HEMlets from the input image, followed by a volumetric joint-heatmap regression. We leverage on the integral operation to extract the joint locations from the volumetric heatmaps, guaranteeing end-to-end learning. Despite the simplicity of the network design, the quantitative comparisons show a significant performance improvement over the best-of-grade method (about 20% on Human3.6M). The proposed method naturally supports training with "inthe-wild" images, where only weakly-annotated relative depth information of skeletal joints is available. This further improves the generalization ability of our model, as validated by qualitative comparisons on outdoor images.
Multi-Scale Supervised Network for Human Pose Estimation
2018 25th IEEE International Conference on Image Processing (ICIP), 2018
Human pose estimation is an important topic in computer vision with many applications including gesture and activity recognition. However, pose estimation from image is challenging due to appearance variations, occlusions, clutter background, and complex activities. To alleviate these problems, we develop a robust pose estimation method based on the recent deep conv-deconv modules with two improvements: (1) multi-scale supervision of body keypoints, and (2) a global regression to improve structural consistency of keypoints. We refine keypoint detection heatmaps using layer-wise multi-scale supervision to better capture local contexts. Pose inference via keypoint association is optimized globally using a regression network at the end. Our method can effectively disambiguate keypoint matches in close proximity including the mismatch of left-right body parts, and better infer occluded parts. Experimental results show that our method achieves competitive performance among stateof-the-art methods on the MPII and FLIC datasets.
Unveiling the Landscape of Human Pose Estimation
Research Square (Research Square), 2024
This paper presents a comprehensive survey and methodology for deep learningbased solutions in articulated human pose estimation (HPE). Recent advances in deep learning have revolutionized the HPE field, with the capturing system transitioning from multi-modal to a regular color camera and from multi-views to a monocular view, opening up numerous applications. However, the increasing variety of deep network architectures has resulted in a vast literature on the topic, making it challenging to identify commonalities and differences among 1 diverse HPE approaches. Therefore, this paper serves two objectives: firstly, it provides a thorough survey of over 100 research papers published since 2015, focusing on deep learning-based solutions for monocular HPE; secondly, it develops a comprehensive methodology that systematically combines existing works and summarizes a unified framework for the HPE problem and its modular components. Unlike previous surveys, this study places emphasis on methodology development in order to provide betters insights and learning opportunities for researchers in the field of computer vision. The paper also summarizes and discusses the quantitative performance of the reviewed methods on popular datasets, while highlighting the challenges involved, such as occlusion and viewpoint variation. Finally, future research directions, such as incorporating temporal information and 3D pose estimation, along with potential solutions to address the remaining challenges in HPE, are presented.
AdaFuse: Adaptive Multiview Fusion for Accurate Human Pose Estimation in the Wild
Occlusion is probably the biggest challenge for human pose estimation in the wild. Typical solutions often rely on intrusive sensors such as IMUs to detect occluded joints. To make the task truly unconstrained, we present AdaFuse, an adaptive multiview fusion method, which can enhance the features in occluded views by leveraging those in visible views. The core of AdaFuse is to determine the point-point correspondence between two views which we solve effectively by exploring the sparsity of the heatmap representation. We also learn an adaptive fusion weight for each camera view to reflect its feature quality in order to reduce the chance that good features are undesirably corrupted by "bad" views. The fusion model is trained end-to-end with the pose estimation network, and can be directly applied to new camera configurations without additional adaptation. We extensively evaluate the approach on three public datasets including Human3.6M, Total Capture and CMU Panoptic. It outperforms the state-of-the-arts on all of them. We also create a large scale synthetic dataset Occlusion-Person, which allows us to perform numerical evaluation on the occluded joints, as it provides occlusion labels for every joint in the images. The dataset and code are released at https://github.com/zhezh/adafuse-3d-human-pose.
Learnable Triangulation of Human Pose
We present two novel solutions for multi-view 3D human pose estimation based on new learnable triangulation methods that combine 3D information from multiple 2D views. The first (baseline) solution is a basic differentiable algebraic triangulation with an addition of confidence weights estimated from the input images. The second solution is based on a novel method of volumetric aggregation from intermediate 2D backbone feature maps. The aggregated volume is then refined via 3D convolutions that produce final 3D joint heatmaps and allow modelling a human pose prior. Crucially, both approaches are end-to-end differen-tiable, which allows us to directly optimize the target metric. We demonstrate transferability of the solutions across datasets and considerably improve the multi-view state of the art on the Human3.6M dataset. Video demonstration, annotations and additional materials will be posted on our project page 1 .
PanopTOP: a framework for generating viewpoint-invariant human pose estimation datasets
2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), 2021
Human pose estimation (HPE) from RGB and depth images has recently experienced a push for viewpointinvariant and scale-invariant pose retrieval methods. Current methods fail to generalize to unconventional viewpoints due to the lack of viewpoint-invariant data at training time. Existing datasets do not provide multiple-viewpoint observations and mostly focus on frontal views. In this work, we introduce PanopTOP, a fully automatic framework for the generation of semi-synthetic RGB and depth samples with 2D and 3D ground truth of pedestrian poses from multiple arbitrary viewpoints. Starting from the Panoptic
A Survey on Human Pose Estimation
IRJET, 2022
Human pose estimation (HPE) depicts the posture of an individual using semantic key points on the human body. In recent times, deep learning methods for HPE have dominated the traditional computer vision techniques which were extensively used in the past. HPE has a wide range of applications including virtual fitness trainers, surveillance, motion sensing gaming consoles (Xbox Kinect), action recognition, tracking and many more. This survey intends to fill in the gaps left by previous surveys as well as provide an update on recent developments in the field. An introduction to HPE is given first, followed by a brief overview of previous surveys. Later, we’ll look into various classifications of HPE (single pose, multiple poses, 2D, 3D, top-down, bottom-up etc.) and datasets that are commonly used in this field. While both 2D and 3D HPE categories are mentioned in this survey, the main focus lies on pose estimation in 2D space. Moving on, various HPE approaches based on deep learning are presented, focusing largely on those optimised for inference on edge devices. Finally, we conclude with the challenges and obstacles faced in this field as well as some potential research opportunities.
FasterPose: A Faster Simple Baseline for Human Pose Estimation
ACM Transactions on Multimedia Computing, Communications, and Applications, 2022
The performance of human pose estimation depends on the spatial accuracy of keypoint localization. Most existing methods pursue the spatial accuracy through learning the high-resolution (HR) representation from input images. By the experimental analysis, we find that the HR representation leads to a sharp increase of computational cost, while the accuracy improvement remains marginal compared with the low-resolution (LR) representation. In this article, we propose a design paradigm for cost-effective network with LR representation for efficient pose estimation, named FasterPose. Whereas the LR design largely shrinks the model complexity, how to effectively train the network with respect to the spatial accuracy is a concomitant challenge. We study the training behavior of FasterPose and formulate a novel regressive cross-entropy (RCE) loss function for accelerating the convergence and promoting the accuracy. The RCE loss generalizes the ordinary cross-entropy loss from the binary sup...