Mocap-2-to-3: Multi-view Lifting for Monocular Motion Recovery with 2D Pretraining (original) (raw)

Zhumei Wang1,2 Zechen Hu311footnotemark: 1 Ruoxi Guo4 Huaijin Pi5 Ziyong Feng3
Liang Zhang6 ✉ Mingtao Pei1 Siyuan Huang2 ✉
1Beijing Institute of Technology 2State Key Laboratory of General Artificial Intelligence, BIGAI
3Deep Glint 4Zhejiang University 5The University of Hong Kong 6Shandong Agricultural University
* Equal contribution ✉ Corresponding authors Project page: https://wangzhumei.github.io/mocap-2-to-3/

Abstract

Human motion recovery for real-world interaction demands both precise action details and metric-scale trajectories. Recovering absolute human pose from monocular input presents a viable solution, but faces two main challenges: (1) models’ reliance on 3D training data from constrained environments limits their out-of-distribution generalization; and (2) the inherent difficulty of estimating metric-scale poses from monocular observations. This paper introduces Mocap-2-to-3, a novel framework that differs from prior HMR methods by recovering absolute poses from monocular input and leveraging abundant 2D data to enhance 3D motion recovery. To effectively utilize the action priors and diversity in large-scale 2D datasets, we reformulate 3D motion as a multi-view synthesis process and divide the training into two stages: a single-view diffusion model is first pre-trained on extensive 2D data, followed by multi-view fine-tuning on 3D data, thus achieving a combination of strong priors and geometric constraints. Furthermore, to recover absolute poses, we introduce a novel human motion representation that decouples the learning of local pose and global movements, while encoding ground geometric priors to accelerate convergence, thereby yielding more precise positioning in the physical world. Experiments on in-the-wild benchmarks show that our method outperforms state-of-the-art approaches in both camera-space motion realism and world-grounded human positioning, while exhibiting strong generalization capability.

[Uncaptioned image]

Figure 1: (a) Traditional framework for direct 3D motion regression. (b) Mocap-2-to-3: our multi-view lifting framework from monocular input which leverages 2D pretraining to enhance 3D motion capture. (c) The model outputs SMPL-format global motions with absolute position from monocular 2D pose input while maintaining out-of-distribution generalization capability. (d) Our model also supports outputs in the COCO-format keypoint.

1 Introduction

Markerless motion capture for downstream tasks involving interaction with the physical world requires motion recovery in absolute coordinates, a requirement that our work addresses by reconstructing absolute positions in world coordinates from monocular input. Such markerless mocap enables a wide range of applications, including gaming, sports analysis, multi-person interactions, and embodied intelligence. Compared to multi-view systems, monocular reconstruction uses less hardware, imposes fewer constraints, and is more practical for downstream tasks [28].

Current state-of-the-art methods [27, 26, 47, 31, 32, 30] heavily rely on precise 3D motion capture data [22, 1, 8, 23] for training, which is costly and requires specialized equipment and controlled environments, limiting accessibility for many research institutions. Moreover, the complex procedures hinder timely acquisition of out-of-distribution 3D data for downstream tasks. Since interaction with the physical world demands higher fidelity in motion details, models often require fine-tuning on domain-specific or homogeneous data to ensure accuracy. Unlike 3D data, 2D data is more accessible, easily obtained from internet videos with diverse real-world actions [24] or through annotated or estimated 2D skeletons in specific scenarios [47, 42].

While monocular motion estimation has shown strong performance in academic settings, prior methods [27, 26] recover only relative positions in the world coordinate system, typically via alignment with the ground truth’s initial frame—limiting practical deployment. In contrast, absolute positioning supports broader applications by requiring environmental awareness and spatial reasoning to infer complex interactions, including human-human, human-object, and human-scene relations. Motivated by the need for accurate interaction with the physical world in markerless motion capture, we focus on recovering global motions with absolute depth from monocular inputs(Fig. 1(c)).

We propose Mocap-2-to-3, a diffusion-based model that leverages 2D data to enhance 3D motion capture. As shown in Fig. 1(a), unlike traditional 2D-to-3D lifting methods that directly regress 3D motion from monocular input, we draw inspiration from [24] and propose a multi-view lifting framework that synthesizes 3D motion from monocular input (Fig. 1(b)). This framework leverages the diversity of 2D data to overcome the poor generalization caused by training solely on limited 3D data. Our model is trained in two stages: (1) a single-view pretraining, which enriches the model’s priors through homologous 2D data, improving its generalization to novel scenarios and reducing motion errors in the physical world; and (2) a view-consistent multi-view fine-tuning, where a View Attention Layer is inserted to enforce multi-view consistency during the generation of other-view 2D poses from an input 2D sequence.

Within this framework, we aim to obtain metric-scale human poses rather than merely recovering aligned global trajectories. Given calibrated camera poses, we estimate novel-view global motion from a monocular input to synthesize 3D motion with absolute positioning. Direct per-view global motion generation is suboptimal, however, as loss dominance by position impedes learning of subtle variations. To address this, we introduce a novel motion representation that decouples local pose and global movement for independent learning, enabling the network to capture more fine-grained motion details. Nevertheless, monocular 2D-to-3D lifting remains inherently ill-posed, as depth (Z-axis) cannot be directly inferred from 2D inputs and requires additional geometric constraints or priors. As a result, learning view-consistent global movement converges slowly. To overcome this, we encode the camera pose into explicit ground-plane constraints, allowing the network to learn geometric priors through cross-attention and thereby accelerate convergence. Through this process, we obtain accurate absolute human poses in the physical world.

Our framework supports lifting any 2D pose format (e.g., SMPL [7], COCO [20], H36M [8]) to 3D by retraining on the desired format(Fig. 1(c)(d)). This work focuses on enhancing the 2D-to-3D lifting process. Errors from inaccurate 2D detectors, caused by significant deviations from ground truth are beyond our scope, as we do not process raw images or apply secondary corrections. By decoupling image input from 3D motion estimation, we use limited 3D data to synthesize large-scale virtual training samples, improving generalization.

Our main contributions are as follows: (1) We propose a multi-view lifting framework from monocular input that leverages 2D pretraining to learn strong motion priors and is fine-tuned on limited 3D data to enable view-consistent generation, effectively lifting 2D knowledge to enhance 3D motion reconstruction. (2) We propose a novel human motion representation that separates local motion from global movement, enabling accurate absolute pose recovery while preserving fine-grained motion details. (3) We compute the ground-plane equation from camera poses and encode it into the network, thereby explicitly constraining human positions in the physical space and accelerating the convergence of multi-view global trajectory learning. (4) Extensive experiments demonstrate the effectiveness and generalization of our method, outperforming prior approaches in both motion accuracy and global positioning, even without scene-specific data.

Table 1: Comparison with related methods.Unlike methods limited to canonical/root-aligned trajectories, Mocap2-to-3 recovers metric-scale trajectories from monocular 2D input and can further leverage 2D data to enhance 3D results.

2.1 Monocular human motion recovery

Early monocular methods such as HMR [9] pioneered end-to-end regression of SMPL [21] parameters directly from images. Subsequent approaches improved accuracy through optimization-in-the-loop refinement (SPIN [15]) and temporal modeling on videos (HMMR [10], VIBE [12], DSD [40]). Later methods enhanced robustness to occlusion (PARE [13]) and incorporated global camera cues (CLIFF [19]). To move beyond camera space, recent “world-grounded” methods decouple human and camera motion. WHAM [27] reconstructs temporally coherent motion trajectories by integrating video-based cues, while GVHMR [26] introduces a gravity-view coordinate system to stabilize long-term orientation. SLAHMR [43] and PACE [14] combine SLAM and learned motion priors for joint optimization of camera and human motion, albeit with heavy computational cost. ROMP [31] and PromptHMR [38] achieve efficient multi-person estimation, and HumanMM [46] leverages multi-camera data for large-scale human motion modeling.

While these approaches successfully recover global trajectories (often from video input), they generally lack metrically accurate absolute positioning in the physical world. By contrast, our goal is to retain high-fidelity action recovery and estimate absolute positions with metric scale, which is crucial for downstream tasks requiring precise interaction with the real world.

2.2 Monocular 3D absolute pose estimation

Several methods aim to recover absolute human poses in a world coordinate system from monocular input. SMPLify [2] jointly optimizes camera and body parameters, while Ray3D [44] projects 2D keypoints into 3D ray space under geometric constraints to achieve accurate localization. SA-HMR [25] infers absolute mesh positions from a single image by leveraging a pre-scanned scene to resolve scale ambiguities. In contrast, TRAM [37] and MetricHMR [45] integrate SLAM-based camera pose estimation for absolute motion recovery, but such estimated poses often introduce bias and accumulate drift, leading to unreliable positional accuracy for interaction.

Similar to [44, 25], which recover absolute poses using calibrated fixed cameras, our framework operates under comparable conditions. By leveraging calibrated camera poses, we minimize system bias and achieve more accurate metric-scale estimations, aligning with our objective of recovering physically grounded, metrically precise poses that support downstream interaction tasks. Unlike previous methods, we propose a novel multi-view 2D-to-3D lifting framework that further decouples coordinates to recover precise absolute poses and leverages 2D pretraining to learn richer motion priors, enhancing generalization.

2.3 2D-driven Motion Generation and Recovery

Recent studies have explored leveraging 2D data for 3D human motion generation and recovery. MotionBERT [47] and ElePose [35] lift 2D poses to 3D but often lack reliable global trajectories. Diffusion-based approaches, such as MDM [33], model temporal dynamics through iterative denoising, while MAS [11] extends 2D diffusion into a multi-view setting to enhance spatial consistency. Motion-2-to-3 [24] utilizes large-scale internet videos during pretraining to improve motion diversity and realism, a strategy that inspires our framework design. Similarly, MVLift [17] demonstrates that global motion can be recovered using 2D-only training; however, the motion quality still lags behind 3D-supervised methods due to the inherent advantages of 3D data—accurate absolute positioning, coordinated dynamics, and consistent skeletal proportions. Building upon these insights, our framework integrates structured 3D data with diverse 2D data, achieving both higher performance and stronger generalization.

In summary, unlike the aforementioned approaches, our method lifts monocular 2D poses to 3D motion, not only leveraging 2D data for pretraining to improve adaptability to out-of-distribution scenarios, but also recovering metrically accurate poses in the physical world. detailed comparison is provided in Tab. 1.

3 Method

Refer to caption

Figure 2: Pipeline overview.During training: (a) We first train an arbitrary single-view 2D Motion Diffusion Model. (b) Its weights are then used to initialize a Multi-view Diffusion Model, conditioned on 2D pose sequences from 𝒱0\mathcal{V}_{0} and pointmaps. During inference, the Multi-view Model generates motions for other views. (c) We compute local poses and global movement to recover global coordinates for each view. (d) Multi-view triangulation is then used to synthesize 3D absolute poses, (e) resulting in full-body global human motion.

We propose Mocap-2-to-3, a markerless motion capture multi-view lifting framework that lifts 2D poses to globally consistent 3D motion from monocular 2D sequences, as shown in Fig. 2. We first pre-train a single-view 2D Motion Diffusion Model using 2D data (Sec. 3.1), then fine-tune a Multi-view Diffusion Model with public 3D data for multi-view consistency (Sec. 3.2). To recover human absolute positions in the world coordinate system, we extend the learning of local poses by additionally estimating global trajectory and scale across multiple views (Sec. 3.3). For enhanced capture of global information, we encode the ground-plane equation into the network to accelerate convergence (Sec. 3.4). Finally, we describe the inference pipeline that lifts 2D pose inputs to 3D motion and recovers the absolute positions of the human body (Sec. 3.5). Our method retains the monocular input but leverages a multi-view diffusion model that captures cross-view motion priors, enables more consistent and accurate 3D motion lifting.

3.1 2D motion pretraining

Existing methods for 3D motion estimation typically require 3D data as labels for training. We explore how to better generalize the model to out-of-distribution scenarios. To this end, we reformulate 3D motion as a multi-view synthesis process and divide the training into two stages: single-view 2D pretraining and multi-view fine-tuning. In the first stage, we train a 2D motion generator for arbitrary camera viewpoints, termed the 2D Motion Diffusion Model. This stage establishes a strong motion prior by leveraging diverse 2D data, including real-world or publicly available videos during pretraining.

Following [33, 24], we employ a transformer-based [34] diffusion model [6] to implement the 2D Motion Diffusion model 𝒟2​D\mathcal{D}_{2D}. Diffusion-based architectures excel at modeling complex feature distributions and produce diverse yet coherent samples across viewpoints, offering clear advantages over deterministic regression backbones. The network takes random noise ϵ\epsilon as input and outputs a 2D motion sequence ℳ∈ℝT×J×2\mathcal{M}\in\mathbb{R}^{T\times J\times 2}, where TT represents the number of video frames, and JJ is the keypoint count. In this stage, the model learns geometric relations to generate 2D motions from arbitrary camera views. By learning single-view generation, the pre-trained model can leverage large-scale 2D data to acquire motion priors across diverse viewpoints, which in turn accelerates convergence during fine-tuning.

3.2 Multi-view fine-tuning

With the motion prior established, the standalone 2D Motion Diffusion model cannot ensure geometric consistency across views. Therefore, in the second stage, we fine-tune the 2D Motion Diffusion model with multi-view 2D supervision derived from 3D motion to enforce cross-view consistency for coherent 3D motion reconstruction. This stage further enables the model to learn canonical representations of human motion.

During fine-tuning, the number of viewpoints VV is set to 4, including a primary camera 𝒱0\mathcal{V}_{0} (used for inference) and three virtual cameras whose poses are randomly sampled from the camera poses seen during the pretraining stage. Camera configuration details are provided in the supplementary material. To train the Multi-view Diffusion model 𝒟m​v\mathcal{D}_{mv}, we project 3D motion into each camera view to obtain geometrically consistent 2D motion ground truth. As no image pairs are required as input, we can apply random augmentations to existing 3D motion, including rotation, translation, and camera viewpoint augmentation (e.g., modifying pitch, yaw, roll, and distance). This enables large-scale virtual data generation from limited samples, enhancing model generalization.

The multi-view generation model [11] independently generates motions across views but lacks explicit consistency constraints. In our framework, we initialize 𝒟2​D\mathcal{D}_{2D} with pre-trained weights and incorporate View Attention layers to enforce multi-view consistency. For motion capture, the model accepts Gaussian noise ϵ\epsilon, the primary view’s 2D motion embedding ℳ0∈ℝT×J×2\mathcal{M}_{0}\in\mathbb{R}^{T\times J\times 2}, and camera embeddings (including camera intrinsics 𝒦∈ℝV×4\mathcal{K}\in\mathbb{R}^{V\times 4} and extrinsics ℛ​𝒯∈ℝV×3\mathcal{RT}\in\mathbb{R}^{V\times 3}) as its input. Geometrically consistent virtual-view 2D motions (see Sec. 3.3 for motion representation details) are generated from ℳ0\mathcal{M}_{0} and subsequently triangulated into 3D motions.

3.3 Decomposed Motion Representation

With the multi-view motion generation framework established, we turn to the other core challenge of recovering the absolute 3D position of motions in the world coordinate. To support the reconstruction from 2D poses to absolute 3D poses, We represent each 2D motion as the projection of a 3D motion in global coordinates under a specific camera viewpoint (Fig. 3(a)). A straightforward approach would be to directly predict projected global coordinates from the given view; however, since position has a much greater influence on the loss than skeletal structure, such prediction tends to make the network focus more on positional cues rather than motion, resulting in degraded motion detail reconstruction (Fig. 3(b)). The main challenge is to achieve globally consistent yet detail-preserving motion reconstruction. To address this, we propose a novel human motion representation that decouples local pose and global movement, enabling independent optimization of action and trajectory. As shown in Fig. 3(c), the local pose ℳl∈ℝT×(J−1)×2\mathcal{M}^{l}\in\mathbb{R}^{T\times(J-1)\times 2} without root position is obtained by cropping the 2D pose within bounding boxes, normalizing it to [−1,1]\left[-1,1\right], and centering the root joint to remove root position influence. The jj-th keypoint is represented as (xt,j,yt,j)\left(x_{t,j},y_{t,j}\right). The global movement ℳτ=[τ,s]∈ℝT×2×2\mathcal{M}^{\tau}=[\tau,s]\in\mathbb{R}^{T\times 2\times 2} consists of the root trajectory τ∈ℝT×2\tau\in\mathbb{R}^{T\times 2} and the motion scale s∈ℝT×2s\in\mathbb{R}^{T\times 2}, corresponding to the pixel coordinates of the bounding box center and scale along the horizontal and vertical axes, respectively. Our multi-view model predicts ℳv∈ℝV×T×(J+1)×2\mathcal{M}_{v}\in\mathbb{R}^{V\times T\times(J+1)\times 2}, comprising (1) root-centered local poses ℳvl\mathcal{M}_{v}^{l} (a (J-1)-dimensional vector), and (2) global movement ℳvτ\mathcal{M}_{v}^{\tau} (a 2-dimensional vector).

During inference, given a monocular input, the model generates virtual-view outputs ℳvl\mathcal{M}_{v}^{l} and ℳvτ\mathcal{M}_{v}^{\tau} for each additional viewpoint. The transformation from multi-view local to global coordinates ℳvg∈ℝV×T×J×2\mathcal{M}_{v}^{g}\in\mathbb{R}^{V\times T\times J\times 2} is then computed as follows:

ℳv,{1:J}g\displaystyle\mathcal{M}_{v,\left\{1:J\right\}}^{g} =ℳvl⋅sv+τv,\displaystyle=\mathcal{M}_{v}^{l}\cdot s_{v}+\tau_{v}, (1)
ℳvg\displaystyle\mathcal{M}_{v}^{g} =[τv,ℳv,{1:J}g].\displaystyle=[\tau_{v}\ ,\mathcal{M}_{v,\left\{1:J\right\}}^{g}].

Here, ℳvl\mathcal{M}_{v}^{l} is used to compute the global coordinates of all joints except the root using svs_{v} and τv\tau_{v}, and then concatenated with the root coordinate τv\tau_{v}. Finally, the multi-view ℳvg\mathcal{M}_{v}^{g} is used to reconstruct absolute 3D poses through camera parameters and triangulation [39].

3.4 Ground Constraint Encoding

Refer to caption

Figure 3: (a) 2D projection coordinates, (b) direct prediction results (failure case). (c) Our decoupled representation separating local pose and global movement.

Refer to caption

Figure 4: (a) Pointmaps representing pixel-to-world coordinate (u,v)↔(xw,yw,zw)(u,v)\leftrightarrow(x_{w},y_{w},z_{w}) mappings. (b) Multi-view pointmaps in world coordinate system.

In monocular-to-multi-view generation, pose learning is easier due to normalized representations across viewpoints, while movement scales vary significantly. However, due to depth ambiguity in monocular settings, learning 2D motion locations for other views from a source view 𝒱0\mathcal{V}_{0} converges slowly, even when camera embeddings are provided as input. Our idea is to leverage physical-world constraints to enhance both localization accuracy and efficiency. To this end, we design a geometric encoding scheme: we introduce explicit geometric constraints by leveraging known camera poses to compute ground planes, which are then represented as intuitive pointmaps [16, 36]. These constraints substantially accelerate the convergence of network training for position learning.

In our work, pointmaps 𝒫∈ℝW×H×3\mathcal{P}\in\mathbb{R}^{W\times H\times 3} represent the mapping from each pixel (u,v)\left(u,v\right) in an image II of resolution W×HW\times H to its corresponding 3D point (xw,yw,zw)\left(x_{w},y_{w},z_{w}\right) in world coordinates, as shown in Fig. 4(a), i.e.,Iu,v↔Pxw,yw,zwI_{u,v}\leftrightarrow P_{x_{w},y_{w},z_{w}}. For any dataset with a world coordinate system grounded on the ground plane, this mapping can be directly computed given the camera intrinsics and extrinsics. Each point is the intersection of a ray from the camera center with the ground plane, forming a view-specific ground point cloud. The detailed computation is provided in the supplementary material. It is important to note that we only include the ground plane rather than full environmental point clouds, as pointmaps can be computed directly from camera poses without additional sensors or ground-truth scans. This avoids extra cost and facilitates real-world deployment. Representing the ground plane as pointmaps allows the network to learn more intuitively, providing a natural 2D–to-3D correspondence across views (Fig. 4(b)).

As shown in Fig. 2(b), pointmaps are incorporated as a conditioning input, first compressed into feature representations through a ResNet-18[5] encoder, and then integrated into 𝒟m​v\mathcal{D}_{mv} with two attention layers: a View Attention Layer to learn cross-view correlations and a Cross Attention Layer to guide the generation of motion ℳv\mathcal{M}_{v}. Pointmaps accelerate the convergence of global movement learning and serve as a plug-and-play module for introducing explicit geometric constraints in any multi-view global estimation task.

3.5 Inference

During inference, the denoising process comprises NN steps. For each timestep nn, model 𝒟m​v\mathcal{D}_{mv} takes[ϵ,ℳ0,𝒦,ℛ​𝒯,𝒫]\left[\epsilon,\mathcal{M}_{0},\mathcal{K},\mathcal{RT},\mathcal{P}\right] as input and predicts the 2D motion sequence ℳvn\mathcal{M}_{v}^{n}. For any viewpoint, ℳvn\mathcal{M}_{v}^{n} is transformed to ℳvg​n\mathcal{M}_{v}^{gn} using Eq. 1, and ℳvg​n\mathcal{M}_{v}^{gn} is triangulated [39] to obtain the 3D absolute pose 𝒲3​dn∈ℝT×J×3\mathcal{W}_{3d}^{n}\in\mathbb{R}^{T\times J\times 3} in the world coordinate system. To enforce multi-view consistency, we project ℳvg​n\mathcal{M}_{v}^{gn} into each view to recompute ℳvl​n\mathcal{M}_{v}^{ln} and ℳvτ​n\mathcal{M}_{v}^{\tau n}, then update ℳvn−1\mathcal{M}_{v}^{n-1} from the previous step. At the final timestep NN, we obtain the 3D motion 𝒲3​d0\mathcal{W}_{3d}^{0} with global position. Following [24, 11], if SMPL [21] parameters are required on top of the reconstructed 3D poses, SMPLify [2] can be applied as a post-hoc fitting step to estimate the parameters from the recovered joints. Refer to the supplementary material for pseudocode.

4 Experiment

Table 2: Quantitative results on RICH in: (1) Root-aligned in Camera Coordinates, (2) World Coordinates with initial-frame alignment, (3) World Coordinates without any alignment. The symbols ∗, ‡, and † denote the inclusion of images, scene scans, and calibrated camera poses as inputs, respectively. The best and second-best results are highlighted green and yellow.

4.1 Datasets and Metrics

Training datasets.To pre-train the 2D diffusion model, we use two types of data: (1) projected 2D joints from HumanML3D [4], training each batch on a single random view; and (2) 2D data from the same source as the test set(e.g., the RICH training set). We then fine-tune the multi-view diffusion model on HumanML3D [4], BEDLAM [1], and Human3.6M [8], where HumanML3D includes HumanAct12 [3] and AMASS [22].

Evaluation datasets.Following [26, 17], we evaluate our model on RICH [7] and AIST++ [18], two real-world datasets covering both outdoor and indoor scenes. They include actions like sitting, lying down, and handstands, which are less represented in the training set and offer a more comprehensive test of the model’s generalization.

Metrics.We follow the evaluation protocol [27, 26] and use standard metrics. In the camera coordinate system, we compute per-frame root-aligned Mean Per-Joint Position Error (MPJPE) and Procrustes-Aligned MPJPE (PA-MPJPE) to evaluate pose accuracy. For world coordinates, we use W-MPJPE (aligned to the first two frames) and WA-MPJPE (with full-sequence alignment) to assess global trajectories. Since our method predicts absolute world positions, we also compute Abs-MPJPE (without any alignment). All position errors are reported in millimeters (mm). Additionally, we evaluate root translation error (Tr​o​o​tT_{root}), motion smoothness (Accel/Jitter), and foot sliding (FS).

Refer to caption

Figure 5: Qualitative comparison on RICH.Global motions are compared after first-frame alignment. Our method generates more realistic OOD motions with accurate body orientation and positioning, while red circles mark unnatural baseline poses.

Refer to caption

Figure 6: Qualitative comparison on RICH.Unaligned absolute pose comparison in shared world coordinates. Unlike baseline methods that exhibit positional drift, our solution maintains accurate localization without requiring additional equipment.

4.2 Lifting SMPL keypoints with ground truth

Our method performs 2D-to-3D lifting and can process different keypoint formats. In this section, we analyze the most widely-used SMPL [21] format. Existing 3D motion prediction methods typically involve two stages: (1) detecting 2D keypoints [42, 29], and (2) predicting 3D poses from 2D keypoints [47, 41]. We focus on the second stage: estimating 3D motion from 2D keypoints, and use ground-truth 2D keypoints for intuitive performance comparison, replacing all baselines’ 2D keypoint inputs with ground truth for fairer comparison.

We compare our method against five categories of baselines: (1) optimization-based methods represented by SMPLify [2]; (2) 2D-to-3D lifting models exemplified by MotionBERT [47]; (3) environment-aware models that predict absolute world-coordinate poses, such as SA-HMR [25]; and (4) state-of-the-art 3D data-dependent methods, including WHAM [27], GVHMR [26], and TRAM [37]. (5) To ensure a fair comparison under identical input settings, we also evaluate two strong hybrid baselines: WHAM+SMPLify [27, 2] and GVHMR+SMPLify [26, 2]. For these hybrid baselines, we initialize SMPLify [2] with the outputs of WHAM [27] or GVHMR [26], and perform optimization using ground-truth 2D keypoints and calibrated camera poses. This integration of state-of-the-art learning-based methods with optimization-based refinement represents the current best-performing paradigm for accurate 3D human motion recovery.

The quantitative results on RICH [7] are shown in Tab. 2. The symbols ∗, ‡, and † denote the inclusion of images, scene scans, and calibrated camera poses as inputs, respectively. In the camera coordinate system, PA-MPJPE and MPJPE assess motion quality after root alignment, removing positional effects. Compared to GVHMR+SMPLify [26, 2], our method reduces PA-MPJPE by 4.5 mm, demonstrating stronger expressive power in reconstructing motion details. In world coordinate evaluation with temporal alignment, our method achieves superior global trajectory estimation. Furthermore, when compared with methods that also take calibrated camera poses as input (marked with †), our approach achieves lower errors in Abs-MPJPE, indicating more accurate estimation of absolute positions. Notably, this is achieved without requiring scanned environmental parameters as in SA-HMR [25], making our spatial constraints both more practical and easier to deploy in real-world settings. Additionally, our results show smoother motion, and the slightly higher foot-sliding error compared to GVHMR [26] is due to our not using foot-sliding as a post-processing optimization like GVHMR [26], which we plan to add in future.

Qualitative comparisons in Fig. 5 show global trajectories after first-frame alignment. For out-of-distribution actions like squatting and bending, Our method generates more realistic poses. Regression-based methods, limited by their 3D training data, often fail to recover reasonable poses when initial estimates are poor, despite 2D keypoint optimization. In contrast, our pretraining effectively learns motion priors that improve generalization to unseen actions. Fig. 6 compares absolute poses in world coordinates (without alignment) between our method and SA-HMR [25] in a shared coordinate system, highlighting global positioning differences. SA-HMR [25] shows notable errors in global position and body scale, while our results align more closely with ground truth.

4.3 Lifting COCO keypoints with detector

To demonstrate the effectiveness of different keypoint formats, we trained a COCO version of the lifting model and present results using the 2D detector ViTPose [42] as input. We selected AIST++[18] as the test set, which provides 3D ground truth in COCO format. Following[17], we compare our method with baselines including ElePose [35], MAS [11], SMPLify [2], MotionBERT [47], WHAM [27], GVHMR [26], and MVLift [17].

Table 3: Quantitative results on AIST++. Symbols ∗ or † indicate the use of images or calibrated camera poses as inputs.

Refer to caption

Figure 7: Qualitative comparison on AIST++. Our method generalizes well to COCO-format skeletons as well.

Quantitative results are presented in Tab. 3. Both MVLift [17] and our method are 2D-to-3D reconstruction approaches. Our method outperforms MVLift [17] and GVHMR+SMPLify[26, 2] in both motion accuracy evaluation (PA-MPJPE) and global trajectory assessment (Tr​o​o​tT_{root}). Qualitatively, as demonstrated in Fig. 7, our approach maintains robust performance even for challenging dance motions, with neither global trajectory nor foot-ground contact exhibiting unrealistic deviations.

4.4 Ablation study

We conduct ablation studies of Mocap-2-to-3 on the RICH [7] dataset, with results in Tab. 4. Comparing rows 1–2 shows that decoupling local pose and global movement representations significantly improves action recognition and trajectory estimation. Rows 2–4 show that without pointmaps, convergence is slower at the same epoch. While pointmaps are not essential for model learning, training to 8k epochs yields comparable performance, showing that pointmaps reduce training time by over 50%. The final rows show that adding just 175 in-domain RICH 2D sequences during pretraining substantially improves motion quality, as reflected in PA-MPJPE and MPJPE. Even without such data, our method outperforms GVHMR+SMPLify[26, 2], confirming the robustness of our lifting framework. These findings suggest that simply using 2D data can effectively enhance 3D motion estimation performance, opening new possibilities for improving the generalization capability of human 3D motion recovery models.

Table 4: Ablation study on RICH: Pointmaps boost convergence; 2D pretraining increases motion accuracy.

5 Limitation and Future work

While our approach demonstrates competitive performance, certain limitations remain. Our framework is capable of lifting arbitrary-format 2D skeletons to high-quality 3D motion. However, we observe that inaccurate 2D skeletons obtained from raw videos can degrade the quality of the reconstructed 3D motion. Importantly, this is not a limitation of our framework itself: when provided with more reliable 2D poses (e.g., manually annotated SMPL-format poses or ViTPose [42]-generated COCO-format poses), our method works effectively and maintains strong performance. In future work, we aim to further improve the 2D SMPL-format motion prediction component and explore incorporating detection confidence during training to enhance robustness. We also plan to integrate additional geometric constraints, such as foot-sliding reduction, to further enhance model fidelity. Furthermore, we are exploring interaction tasks built upon this work, targeting embodied and gaming scenarios, aiming to further validate the method’s potential in complex real-world applications.

6 Conclusion

We present Mocap-2-to-3, a novel framework that leverages accessible 2D data to enhance 3D estimation and recover metrically accurate poses from monocular input. Overall, our approach bridges the gap between conventional human motion recovery and markerless motion capture, enabling accurate and physically consistent motion estimation in real-world scenes, and offering a promising direction for future research in this domain.

Acknowledgments. This work is supported in part by the Opening Project of the State Key Laboratory of General Artificial Intelligence, BIGAI/Peking University, Beijing, China (Project No. SKLAGI2025OP17), and Deep Glint.

References