RaC: Robot Learning for Long-Horizon Tasks by Scaling Recovery and Correction (original) (raw)

spacing=nonfrench\correspondingauthorzheyuanh@andrew.cmu.edu. ⋆Co-advising.

Robyn Wu Carnegie Mellon University Naveen Enock Carnegie Mellon University Jasmine Li Carnegie Mellon University Riya Kadakia Carnegie Mellon University Zackory Erickson⋆ Carnegie Mellon University Aviral Kumar⋆ Carnegie Mellon University

Abstract

Abstract:Modern paradigms for robot imitation train expressive policy architectures on large amounts of human demonstration data. Yet performance on contact-rich, deformable-object, and long-horizon tasks plateau far below perfect execution, even with thousands of expert demonstrations. This is due to the inefficiency of existing “expert” data collection procedures based on human teleoperation. To address this issue, we introduce RaC, a new phase of training on human-in-the-loop rollouts after imitation learning pre-training. In RaC, we fine-tune a robotic policy on human intervention trajectories that illustrate recovery and correction behaviors. Specifically, during a policy rollout, human operators intervene when failure appears imminent, first rewinding the robot back to a familiar, in-distribution state and then providing a corrective segment that completes the current sub-task. Training on this data composition expands the robotic skill repertoire to include retry and adaptation behaviors, which we show are crucial for boosting both efficiency and robustness on long-horizon tasks. Across three real-world bimanual control tasks: shirt hanging, airtight container lid sealing, takeout box packing, and a simulated assembly task, RaC outperforms the prior state-of-the-art using 10×\times less data collection time and samples. We also show that RaC enables test-time scaling: the performance of the trained RaC policy scales linearly in the number of recovery maneuvers it exhibits. Videos of the learned policy are available at https://rac-scaling-robot.github.io/.

1 Introduction

Running imitation learning with expressive models on human teleoperation data powers a large chunk of modern robotic learning. In fact, a number of recent academic and industrial bets have been on massively scaling up imitation learning as a form of pre-training for robots [2, 42, 6, 34, 4, 5, 41, 50]. However, results increasingly suggest that this paradigm is approaching a performance ceiling well below perfect task completion. For example, even with over 50005000 human demonstrations, state-of-the-art task-specific models can only place a single t-shirt on a hanger with bimanual manipulators at roughly 75%75\% success. While one might hope that more data or alternative learning frameworks could close this gap, in practice these methods still struggle to overcome compounding errors and stochasticity in long-horizon tasks.

We argue that this limitation of imitation is fundamental: while mimicking expert actions can imbue the policy with “basic” useful skills, doing so is inherently suboptimal when the robot faces task variations or new initial states, the environment is stochastic or noisy, or the task is inherently long-horizon, where failing at one stage inhibits success in the rest (i.e., when “compounding errors” can be catastrophic) [15]. As a result, policies trained via imitation learning often fail to generalize to real-world stochasticity and dynamism, and exhibit diminishing returns with additional data, leading to a performance plateau. Crucially, this failure stems not from the learning algorithm or the model but from the data distribution itself: demonstrations are biased toward clean, successful trajectories, but do not imbue the policy with behaviors needed to tackle compounding errors stemming from stochasticity in long-horizon tasks.

In this work, we propose an alternative paradigm for training robot policies that directly addresses the limitations of success-only imitation learning. We introduce a new phase of learning that is run subsequent to basic imitation learning on clean teleoperation data (“pre-training”), which we refer to as RaC. The central idea of RaC is to train on trajectories that interleave successful task executions with segments that demonstrate recovery, retries, and adaptation: behaviors that are essential for robustness in complex or novel situations. While standard human teleoperation data may already contain some incidental recovery behavior,111For instance, in a study of the DROID dataset, we find that only 3.68%3.68\% of the episodes contain recovery behavior. RaC explicitly encourages and amplifies such behaviors. Conceptually, this phase is analogous to “mid-training” for large language model (LLM) reasoning [45], which aims to illustrate how to best combine basic knowledge with algorithmic behavior (e.g. backtracking, trial-and-error, self-verification, etc.) to solve complex reasoning problems by producing much longer responses.

Refer to caption

Figure 1: Illustrating RaC. Our approach enables imitation learning policies to robustly execute long-horizon tasks by explicitly learning skills such as recovery and correction to handle mistakes and failures. Doing so substantially improves data efficiency and results in effective performance scaling at test time by executing rollouts with more recovery maneuvers.

Concretely, we introduce a lightweight human-in-the-loop data collection protocol: human teleoperators intervene to take control from the running policy as soon as it begins to deviate from the expected course. As shown in Figure 3, these interventions naturally fall into two categories: a) error correction segments, where human experts guide the robot to solving tasks (similar in spirit to DAgger-style supervision), and b) recovery segments, where the human rewinds or repositions the robot to a previously successful state. To scale up recovery and correction for imitation learning, RaC standardizes interventions with two rules. Rule 1 (recover then correct) structures every human takeover into a reset back to in-distribution states followed by a corrective segment that completes the current sub-task. Rule 2 (termination after intervention) ends the episode immediately once the intervention segment finishes, which avoids collecting data on later sub-tasks under state distributions from a mixture of learned policy and human expert. Crucially, RaC keeps the imitation objective unchanged; performance gains come purely from improved data composition. Applied to three challenging, long-horizon real-world bimanual control tasks, shirt hanging, airtight-lid sealing, and clamshell takeout-box packing, RaC outperforms batched full-demonstration and HG-DAgger style human-in-the-loop collection, both in performance and in data efficiency. In particular, RaC achieves higher success rates and steeper scaling trends than batched full demonstration and HG-DAgger-style human-in-the-loop data collection, demonstrating superior data efficiency up to 1 order of magnitude. We further show that, analogous to long chain-of-thought (CoT) reasoning in language models [16, 35], policies trained with RaC exhibit test-time scaling (Figure 10): as the deployed policy executes more recovery maneuvers at test-time, its overall task success rate improves as it can try multiple times.

Contributions. We introduce RaC, a framework for scaling imitation learning in long-horizon manipulation by leveraging recovery and correction. Real-robot experiments show that conventional data pipelines lack the ingredients to learn diverse skills needed to handle out-of-distribution states, which arise frequently due to compounding errors. Our approach delivers test-time scaling benefits akin to “o1-style” LLMs, absent in prior work. On three challenging real-world bimanual tasks, RaC achieves higher success rates and steeper scaling trends than full demonstrations or HG-DAgger-style interventions.

Scaling data in robotic learning. Recent work shows that scaling real-robot data across tasks, embodiments, and environments enables generalization. Large robotic datasets [23, 43, 11, 6], paired with highly expressive neural network architectures [4, 5, 34, 24, 2, 30, 42], have produced generalist policies that achieve strong performance on many atomic skills (e.g., grasping an object, folding cloth). In parallel, another line of work [50, 10] demonstrates that a similar data-driven recipe can also produce specialist policies that perform very well on substantially more complex dexterous bimanual tasks. However, these approaches require collecting thousands of high-quality expert demonstrations per skill [50].

Scaling studies in robot imitation learning. Inspired by work in LLMs [20, 18], several works aim to build scaling laws for robotic imitation [48, 14, 27]. Aimed at evaluating generalization across variations in the task, some of these works analyze the performance of policies on short-horizon tasks as a function of the environmental diversity present in the training data. However, in all such studies, the demonstrations themselves are collected via human “expert” teleoperation and exhibit little variation within the sorts of skills shown in the data. In contrast, instead of environment diversity, we focus on the data collection strategy within a trajectory for long-horizon tasks: specifically, the kinds of maneuvers, recovery behaviors, and variations within. As we show in our experiments, carefully designing a trajectory-level collection strategy can improve efficiency by more than 10×10\times compared to previous work [50] with similar tasks.

_Human-in-the-loop imitation learning._Our approach collects intervention data by emphasizing recovery and correction behaviors, which connects it to the broad literature on human-in-the-loop imitation learning. Classical approaches are rooted in DAgger [37], which alternates between 1) running on-policy rollouts from the learner, 2) querying the expert on visited states, and 3) retraining on the aggregated dataset. This framework assumes access to a high-quality expert policy. To adapt DAgger to human operators, HG-DAgger [22] enables teleoperators to provide interventions when policy visits undesirable states, while more recent systems such as RoboCopilot [46] extend these ideas to bimanual mobile manipulation by developing improved interfaces for teleoperation and intervention. Other works [33, 29] explore objectives that combine on-policy rollouts, intervention data, and full human demonstrations. Although our learning objective bears similarities to HG-DAgger [22], we depart from its formulation in a crucial way: prior works largely treat human intervention as an optimal expert solution to be imitated but we show that collecting recovery segments, which by themselves are not task-optimal (and may even undo progress on a subtask), yields substantially better scaling. This challenges the conventional wisdom that only “expert” interventions are useful and highlights the role of trajectory-level data collection.

Shared autonomy. Effectively collecting intervention data requires responsive and intuitive teleoperation interfaces. Prior human-in-the-loop systems have typically relied on 6-DoF SpaceMouse [29, 31, 32] or smartphone softwares with on-screen buttons and IMU sensing [33]. While functional, these devices come with steep learning curves[25] and are difficult to use for dexterous skills, particularly those requiring wrist rotation. As a result, they are mostly limited to single-arm settings or relatively simple manipulation tasks where end-effector poses remain constrained. More recent work [46] has explored combining VR joysticks with exoskeleton hardware to provide force feedback and richer intervention options, but this demands specialized equipment and additional cost. In contrast, we adopt widely available off-the-shelf VR joysticks as our teleoperation and intervention interface. With a lightweight software modification that we described in Section 4.3, our design enables users to take over control and provide interventions instantly, without the need to align the VR joystick poses with the robot end effector poses.

Recovery and correction in imitation learning. Several works also study employing recoveries and corrections for training via imitation learning. Wang et al. [44] proposes a “rewind-and-refine” data collection system that detects failures, returns the robot to a previous pose via replaying the trajectory, and then the teleoperator collects corrective trajectories. Similarly, [1, 19] studies generating and filtering recovery trajectories automatically in simulation to augment dataset coverage. However, these works are restricted to pure simulation tasks or limited sim2real settings. Sun and Song [40] trains a base diffusion policy on expert data and a learned latent dynamics model that performs test-time steering, encouraging the policy to stay on the expert demonstration manifold. Ke et al. [21] learns a locally Lipschitz dynamics model from expert demonstrations and synthesizes corrective labels near the demo manifold to mitigate compounding errors. Xu et al. [47] combines a compliant intervention interface to provide corrections and learns a residual policy to improve the performance of the contact-rich tasks. Instead of engineering the return to in-distribution states through an engineered rewind mechanism or modifications to the base imitation learning policy, our approach RaC treats recovery as yet another “skill” to learn from human demonstrations and scales it explicitly alongside full demonstration and correction skills. Hence, without modifying existing imitation learning objectives or adding additional complexity to the robot system, we improve the robustness and performance of the policy by directly scaling human demonstration data. Brandfonbrener et al. [3] proposes a somewhat similar data collection protocol to RaC, in which operators deliberately collect sequences of visually similar failures, recoveries, and successes by backtracking to earlier visual states. However, Brandfonbrener et al. [3] studies the benefit of such data collection strategy through the lens of offline reinforcement learning, enabling efficient learning of accurate value functions from small datasets. We instead focus on scaling properties of such recovery skills in dataset composition and their impact on imitation learning policy.

3 Background and Robot Setup

Refer to caption

Figure 2: Bimanual manipulation robot system. An illustration of our bimanual robot setup showing camera placements and workspace setup.

Robot setup. Our robot setup (Figure 2) consists of two 7-DoF xArm-7 manipulators with scaled-down version of soft grippers [10, 51] to facilitate contact-rich and dexterous tasks. To obtain reactive control, a central server synchronizes and publishes RGB image streams from a top-view camera and two wrist cameras, robot state, and action commands at 60Hz. Our system utilizes RMPFlow [8] as the inverse-kinematic motion generator, enabling real-time collision avoidance and smooth arm motions.

From a purely machine learning standpoint, our work is situated in the setting of iterative imitation learning with evolving robotic datasets. Each trajectory τ\tau in this dataset consists of an action ata_{t} for every observation sts_{t}. In this paper, we develop an approach to collect data for imitation learning that results in better scaling by incorporating human interventions on a previous generation of the learned policy. Formally, our goal is to develop an iterative human data collection strategy that improves scaling of task performance as a function of data-collection budget. In other words, we aim to improve the scaling behavior, i.e., the slope of task success rate vs. data size. To study data compositions, our data consists of three types: (i) full, successful expert demonstrations 𝒟full\mathcal{D}^{\mathrm{full}}; (ii) recovery segments, that begin in failure or out-of-distribution regions and return to in-distribution regions; and (iii) correction segments that directly complete the current subtask.

Data collection protocol. Our data collection begins with collecting one round of full demonstration data using an initial budget size R0R_{0}, in terms of hours or the number of frames/timesteps. We then first train an initial policy π0\pi_{0} using this “Round 0” full-demonstration data and evaluate its performance. Round 0 data could also come from off-the-shelf imitation learning dataset already available in the community. We can then scale data by following two protocols. In the batched data collection protocol, we allocate an additional budget of K×R0K\times R_{0} frames, yielding a single batch of expert data of size (K+1)×R0(K+1)\times R_{0}. In the iterative human intervention protocol [22, 46, 33, 29], experts instead perform KK alternating rounds of intervention and training: in each round kk, they provide interventions during rollouts of πk−1\pi_{k-1}, aggregate these intervention segments with existing data (in different ways), and retrain a new policy πk\pi_{k}. We will study the nature of interventions that improve data scaling of imitation learning the most.

Refer to caption

Figure 3: Illustrating the core concept behind RaC. Data collected via human interventions prescribed by RaC and a sample policy rollout when training on only correction data (“HG-DAgger”) vs recovery and correction data (RaC). Typical intervention approaches (left) simply collect correction data that pushes the task forward from out-of-distribution states, aiming to push the task forward. In contrast, RaC (right) collects a recovery segment that places the robot into a prior familiar state followed by a correction segment that pushes the task forward from this state. This “densifies” coverage over familiar states and teaches the robot to also recover to a broader region of initial states. Since learning to recover is easier than pushing the sub-task forward, RaC exploits a form of “verification-generation gap” [38]: it allows the policy to succeed more by recovering multiple times.

4 RaC: Scaling Recovery and Correction for Imitation Learning

Our goal is to design an iterative data collection strategy for scaling imitation learning. Unlike prior iterative approaches that collect corrective segments [22, 46, 33, 29], our approach deliberately guides human interventions to include a substantial proportion of “recovery” segments alongside “corrective” segments. While recovery segments are suboptimal for completing any sub-task within the long-horizon task, they bring the policy back into an in-distribution state preemptively, giving it a chance to re-attempt the task (Figure 3). In contrast, corrective segments illustrate how to complete the task. Our main insight is that the ability to retry multiple times gives the policy a generic recipe to attenuate compounding errors that often bottlenecks imitation, by trading off acting longer for lower error. We formalize this notion of recovery and correction and develop an approach to collect data naturally rich in these behaviors.

4.1 Understanding the Role of Recovery and Correction Segments in Imitation

Consider a robot policy π\pi that executes a trajectory τ=(s0,a0,s1,a1,…,st)\tau=(s_{0},a_{0},s_{1},a_{1},\ldots,s_{t}), where sts_{t} denotes the state at which a human expert intervenes. A sequence of human actions (at+1h,at+2h,…,at+kh)(a^{\text{h}}_{t+1},a^{\text{h}}_{t+2},\ldots,a^{\text{h}}_{t+k}) starting from sts_{t} constitutes a recovery segment if the resulting state st+khs^{\text{h}}_{t+k} that the robot reaches after the intervention lies within the distribution of states visited in the prefix of human demonstrations 𝒟full[0:t]\mathcal{D}^{\text{full}}[0:t]. Conversely, this sequence of actions constitutes a corrective segment if the resulting state st+khs^{\text{h}}_{t+k} lies within the distribution of states visited after timestep tt in demonstrations 𝒟full[t+1:H]\mathcal{D}^{\text{full}}[t+1:H]. We illustrate this concept in Figure 3.

How can recovery segments improve performance? Intuitively, recovery segments return the policy to familiar previous states, giving it another chance to attempt the task, whereas corrective segments show how to push the trajectory forward. This raises a question: can a policy actually learn to “reset” itself by imitating recovery segments, and why would this improve performance? Our key intuition is that in tasks where the set of valid initial states is broad (e.g., for the task of inserting a t-shirt on an hanger, any configuration where a shirt lies on a table and a hanger is in one of the robot’s arms somewhere above the shirt is an initial state) but the set of valid goal states is narrow (e.g., only when the shirt is correctly inserted on the hanger resting on the rack), resetting to a previously encountered state is generally far easier than executing a sub-task correctly (e.g., inserting the collar of the shirt onto the hanger). Because there are multiple familiar past states to reset to, recovery requires less precision and can be more sample-efficient to learn than solving the task. This is akin to the presence of a verification-generation gap (VG-gap) [38], where learning one skill (recovery) is more sample-efficient than the other.222For most long-horizon tasks in the real world, this structure naturally arises: progress on earlier sub-tasks is often essential for success on later ones. Furthermore, because large-scale imitation learning systems typically aggregate demonstrations from multiple teleoperators, the resulting data introduces substantial diversity, especially early-on in the attempt to solve the task.

This means that training via imitation learning on a mixture of recovery and corrective behavior should equip a policy with two complementary ways to improve performance: (1) by mimicking corrective segments (and full demonstration) to make progress in the first shot, with at least some probability, and (2) by resetting to a previous familiar state and retrying. When the setting exhibits the structure above, this ability to recover can be acquired with relatively little data. Once the policy can reliably recover from an anticipated failure, repeated retries would then naturally amplify the overall probability of producing at least one attempt that correctly executes the sub-task. In fact, the probability of never succeeding on a sub-task decays exponentially with the number of retries. This means that total suboptimality in imitation learning performance should decrease. This mechanism is akin to sequential test-time scaling [39] in large language models (LLMs): just as long chain-of-thought (CoT) models [16] improve performance and generalization by spending more tokens on backtracking and recovery before re-attempting a question, we expect RaC to achieve similar gains by performing backtracking and retrying directly in action space.

How can recovery segments improve data scaling relative to HG-DAgger style methods? Recovery segments improve data efficiency because returning to familiar in-distribution states requires less data than mastering corrective sub-tasks in many cases. From in-distribution states, the policy already has strong supervision from existing data and the newly added corrective segments amplify this supervision. In contrast, methods like HG-DAgger demonstrate an entirely new behavior from an unfamiliar out-of-distribution state and require the policy to master it. As a result, performance as a function of data scale is expected to be lower for HG-DAgger since it does not necessarily amplify coverage over either in-distribution states or new unfamiliar states within limited intervention budgets.

Summary: Recovery and Correction Behaviors for Imitation • Simply scaling full demonstrations from experts prioritizes optimal trajectories but leaves out-of-distribution (OOD) and failure states under-covered, leading to compounding error. • Training on trajectories that recover from failure or OOD states and complete task after recovery promotes recovery-then-retry behaviors, boosting performance and data efficiency.

4.2 Scaling Recovery and Correction Segments in Human Teleoperation

Next, we turn to the question of how to collect imitation data that contains a substantial proportion of both recovery and correction segments. In principle, one could simply instruct human teleoperators to artificially stage possible failure states, and demonstrate recovery and corrective behaviors [3]. However, such behaviors produced by humans from contrived or “fake” states may not reflect the out-of-distribution errors that a learned policy would actually encounter. Since policy mistakes are tightly coupled with the policy itself, a purely offline approach is unlikely to be effective (akin to LLMs [38]). A more effective alternative is to collect this data through human-in-the-loop interventions. Analyzing human intervention data in Section 5.3, we find that it is difficult to achieve a good balance between recovery and correction with no standardization of data collection protocol. Thus, we prescribe two simple rules for intervention:

_Rule 1: Pair each recovery segment with a correction segment. Each intervention is structured to contain two phases. First, the human operator performs recovery behavior by executing a sequence of actions that bring the robot system back into a familiar in-distribution region of states. Then, the operator provides corrective behavior, attempting to push the current sub-task forward (see Figure 3 for an illustration). This simple structure ensures that every intervention teaches the policy both how to reset itself and how to make progress, rather than overemphasizing one or the other.

_Rule 2: Terminate after intervention. After an intervention concludes, we terminate the entire episode. In long-horizon tasks, later sub-tasks depend on the correct execution of earlier ones. Allowing the rollout to continue after human intervention would contaminate later sub-tasks with a distribution of states induced by a combination of the learned policy and the human teleoperator. While not problematic in itself, learning on this distribution of states might not necessarily improve the policy under its own induced distribution of states when it attempts the later sub-tasks, which can be fairly different from the joint human and policy distribution in a particular intervention rollout. This means that intervention data for later sub-tasks after an intervention is likely to not help much, but cost us more samples. Instead, terminating early allows allocating a total data collection budget more towards improving early sub-tasks.

Summary: Balanced composition of recovery and correction For the widely-used DROID dataset [23], an analysis on its 1% sub-sample reveals only 3.68%3.68\% of episodes contain ≥1\geq 1 recovery and 16.58%16.58\% contain ≥1\geq 1 correction. Similarly, in Section 5.3, our HG-DAgger data skews heavily toward corrections with scarce recovery segments. RaC standardizes data collection for interventions: pair a recovery with a correction, then terminate, to produce a balanced mixture of skills, improving robustness and data efficiency.

4.3 Shared Autonomy Interface and Guidance for Teleoperators

Refer to caption

Figure 4: VR handset interface for shared autonomy in RaC. We design and implement a “clutch” design that enables smooth handover from the robot policy to the human teleoperator.

To enable effective interventions for RaC, we design a lightweight shared-autonomy interface using Oculus Quest VR controllers. Our design uses a “clutch” mechanism that unifies policy execution and human takeover: when the side button is pressed, controller motions are mapped directly to the end effector enabling the human to take over control and intervene, and when the side button is released, the robot follows the learned policy. To reduce operator effort, we adopt a local-frame registration scheme with relative pose deltas. Let vv denote the fixed VR headset coordinate frame, and let ctc_{t} denote the hand-controller frame at time tt. At clutch engagement (t=0t=0), we define the controller’s pose relative to the headset frame, Tc0vT^{v}_{c_{0}}, as the local base frame. Subsequent poses are then expressed in this local frame asTcc0​(t)=(Tc0v)−1​Tctv,T^{c_{0}}_{c}(t)=(T^{v}_{c_{0}})^{-1}T^{v}_{c_{t}},with translational Δ​pk=pk−pk−1\Delta p_{k}=p_{k}-p_{k-1} and rotational Δ​Rk=Rk−1⊤​Rk\Delta R_{k}=R_{k-1}^{\top}R_{k} offsets used to parameterize end-effector commands. This design eliminates the need for global posture alignment, allowing operators to take over and intervene with minimal friction. A picture is shown in Figure 4.

Refer to caption

Figure 5: Visual aid for guiding intervention data collection. We utilize overlaid heatmap of the grippers visitation frequency to illustrate in-distribution regions that a teleoperator should recover to. In the clamshell-takeout-box-packing task, when the policy fails to scoop the burger using the spatula in sub-task 3 as shown above, the teleoperator recovers to the position inside the bounding box marked with sub-task 3 for retrying again.

Guidance for intervention data collection. To facilitate operators in demonstrating trajectories that adhere to the recovery then correction rule, we build a lightweight software tool using the image segmentation model SAM2 [36]. This tool renders a robot end effector visitation frequency heatmap by tracking robot grippers across all RGB frames recorded by the overhead camera in the initial round of full demonstration data collection. As shown in Figure 5, during data collection, we overlay this heatmap onto the overhead camera’s display window to provide visual aid, showing in-distribution regions where the robot grippers should recover back to upon intervention. Our approach is one way that can be used to guide recovery demonstrations towards in-distribution regions.

4.4 Policy Architecture and Training via Imitation Learning

We now run imitation learning from a dataset containing multi-modal, long-horizon behaviors of various

Refer to caption

Figure 6: Policy architecture. We train all imitation learning policies using a multi-modal diffusion transformer (mm-DiT) architecture[12] via a flow matching objective.

​types: 1) full demonstrations, 2) the policy’s own full successes from online rollouts, and 3) human intervention segments with recoveries and corrections. Fitting various sources of data demands a high-capacity policy architecture, with sufficiently expressive output heads [9, 2, 30]. Therefore, we utilize a flow-matching [28] policy to fit an action chunk [49], At=[at,at+1,…,at+H−1]A_{t}=[a_{t},a_{t+1},...,a_{t+H-1}] conditioned on observation ot=[It1,It2,It3,qt]o_{t}=[I^{1}_{t},I^{2}_{t},I^{3}_{t},q_{t}], where ItiI^{i}_{t} is the i-th RGB camera image and qtq_{t} is a vector of robot states containing end effectors velocities and relative distance from each other at timestep tt. For all tasks, we use H=60H=60, equivalent to predicting one second of actions into the future.

Our policy is a 300 million parameter, multimodal diffusion transformer (MM-DiT) architecture [12]. We use separate ResNet-50 [17] vision encoders for all 3 camera views (1 overhead and 2 wrist cameras) in our real-world experiments and ResNet-18 encoders in simulation. We optimize a conditional flow matching loss [28] for training:

| ℒFlow​(θ)=𝔼ot,At∼𝒟,x0∼𝒩​(0,Id),τ∼Unif​([0,1])​[‖vθ​(τ,ot,xτ)−(At−x0)‖22],\displaystyle\mathcal{L}_{\text{Flow}}(\theta)=\mathbb{E}_{\begin{subarray}{c}o_{t},\,A_{t}\sim\mathcal{D},\\ x^{0}\sim\mathcal{N}(0,I_{d}),\\ \tau\sim\text{Unif}([0,1])\end{subarray}}\left[\left\|v_{\theta}(\tau,o_{t},x^{\tau})-\big{(}A_{t}-x^{0}\big{)}\right\|_{2}^{2}\right], | (4.1) | | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----- |

where xτx^{\tau} denotes an interpolant computed at time τ\tau of the flow, vθ​(τ,ot,xτ):[0,1]×S×ℝd→ℝdv_{\theta}(\tau,o_{t},x^{\tau}):[0,1]\times S\times\mathbb{R}^{d}\rightarrow\mathbb{R}^{d} is velocity at xτx^{\tau}, and dd is the total dimensionality of action chunks we use. Importantly, when sampling training data from 𝒟\mathcal{D}, we do not include any transitions from the robot’s own rollouts, unless the trajectory reaches full task completion without any human intervention. This design choice is consistent with HG-DAgger [22, 46], but different from other methods such as IWR and follow-ups [33, 29], that filter segments based on human knowledge. Additional details of policy training are in Appendix A.

During inference, we generate actions by taking 10 Euler integration steps using the learned vector field from t=0t=0 to t=1t=1, starting with random noise At0∼𝒩​(0,Id)A^{0}_{t}\sim\mathcal{N}(0,I_{d}). Following [9, 10, 2], we run policy inference once every 0.5 seconds, i.e., we execute the first half of each action chunk and then replan. A complete pseudocode of the procedure is shown in Algorithm 1.

Algorithm 1 RaC Data Collection Protocol

1:Given per-round human data collection budget ℬ\mathcal{B} measured in the number of frames; total human intervention data collection rounds KK.

2:Initialize flow-matching policy πθk=0\pi_{\theta}^{k=0}; dataset 𝒟0:K←∅\mathcal{D}_{0:K}\leftarrow\varnothing

3:Collect ℬ\mathcal{B} frames of expert demonstrations in Δ​𝒟0\Delta\mathcal{D}_{0}; 𝒟0:K←Δ​𝒟0\mathcal{D}_{0:K}\leftarrow\Delta\mathcal{D}_{0}; πθk=0←Train​(𝒟0:K)\pi_{\theta}^{k=0}\leftarrow\textsc{Train}(\mathcal{D}_{0:K}) via Equation 4.1;

Human Intervention Data Collection Rounds

1:for k=1k=1 to KK do

2: initialize human policy πH\pi_{H}, intervention function II

3: Δ​𝒟k←∅\Delta\mathcal{D}_{k}\leftarrow\varnothing; b←0b\leftarrow 0 ⊳\triangleright bb data collection budget used in this round

4: while b<ℬkb<\mathcal{B}_{k} do

5: s0←env.reset();traj←[];intervened←false;t←0s_{0}\leftarrow\texttt{env.reset()};\;\texttt{traj}\leftarrow[\,];\;\texttt{intervened}\leftarrow\textbf{false};\;t\leftarrow 0

6: while not env.done() do

7: if I​(st)=0I(s_{t})=0 then at∼πθk−1(⋅∣st)a_{t}\sim\pi_{\theta}^{k-1}(\cdot\mid s_{t}); is_human←0\texttt{is\_human}\leftarrow 0

8: else at∼πH(⋅∣st)a_{t}\sim\pi_{H}(\cdot\mid s_{t}); is_human←1\texttt{is\_human}\leftarrow 1; intervened←true\texttt{intervened}\leftarrow\textbf{true} ⊳\triangleright Rule 1: Pair each recovery a correction

9: st+1←env.step​(at)s_{t+1}\leftarrow\texttt{env.step}(a_{t}); traj.push​(st,at,is_human)\texttt{traj.push}(s_{t},a_{t},\texttt{is\_human}); t+=1t{+}{=}1

10: if is_human=0\texttt{is\_human}=0 and InterventionDone​()\textsc{InterventionDone}() then break ⊳\triangleright Rule 2: Terminate after intervention concludes

11: if intervened=false then ⊳\triangleright If an entire trajectory has no human intervention ⇒\Rightarrow

12: Δ​𝒟k∪=traj\Delta\mathcal{D}_{k}\cup\!=\texttt{traj} ⊳\triangleright add full trajectory into dataset, with no human budget counted

13: else

14: Δ​𝒟k∪={(s,a)∈traj:is_human=1}\Delta\mathcal{D}_{k}\cup\!=\{(s,a)\in\texttt{traj}:\texttt{is\_human}=1\} ⊳\triangleright add only human intervention transitions into dataset

15: b=b+|traj|b=b+|\texttt{traj}| ⊳\triangleright charge full episode length to budget

16: 𝒟0:K∪=Δ​𝒟k\mathcal{D}_{0:K}\cup\!=\Delta\mathcal{D}_{k}; πθk←Train​(𝒟0:K)\pi_{\theta}^{k}\leftarrow\textsc{Train}(\mathcal{D}_{0:K}) ⊳\triangleright Aggregate datasets, then train policy via flow-matching 4.1

5 Experimental Evaluation of RaC

Our goal is to evaluate the data efficiency and scaling of RaC on bimanual, long-horizon manipulation tasks. Concretely, we aim to answer the following questions: (1) Does RaC improve data scaling compared to standard human full demonstration data collection, including existing state-of-the-art results?, (2) How does RaC compare to human-in-the-loop imitation learning methods such as HG-DAgger [22]?, (3) Is enhancing the proportion of recovery behaviors critical for effective performance?, and (4) How do policies learned by RaC differ from traditional imitation learning policies? We answer these questions through experiments in three real-world long horizon tasks. We also use a combination of real and simulated experiments to provide ablations to establish the role of recovery behaviors in training enabling test-time scaling on long-horizon tasks, with extra ablations on design choices of RaC in Section 5.4.

5.1 Evaluation Domains and Task Setups

We design four bimanual dexterous manipulation tasks, with three in real world and one in simulation (see Figure 7). All tasks require controlling the robot over extended horizons, demanding successful completion of interdependent sub-tasks sequentially. Our real-world tasks are inspired from some of the most difficult challenges explored in prior work [50, 26]. We briefly describe these tasks below and refer readers to detailed task definitions in the Appendix and show videos on our website.

Comparisons and evaluation protocol. We compare the scaling characteristics, performance, and learned behaviors of RaC against two approaches for imitation learning: (1) scaling up batched full expert

Refer to caption

Figure 7: Long-horizon robot tasks. We study 3 real-world tasks, shirt-hanging, airtight-container-lid-sealing, clamshell-takeout-box-packing, and a simulated bimanual-assembly task.

​​data collection, and (2) performing human-in-the-loop interventions as per HG-DAgger [22]. For each task, we allocate a total budget of K×NK\!\!\times\!\!N demonstrations for the batched setting, where NN is a base number of demonstrations chosen in advance. To match this budget, we run KK rounds of human-in-the-loop data collection, each with equivalent per-round budget, and train the policy in each round using the corresponding intervention data. We conduct evaluations with 60 trials for the real-world tasks and 100 trials in the simulation task with varying initial configurations (details in Appendix C and videos on website). When rolling the trained policy out during evaluation, we record the performance for each sub-task upto an irrecoverable failure, then we terminate the episode. We measure sub-task performance per a binary success or failure indicator function without assigning partial credits. We report two performance metrics: task success rates and task progress scores. Task success rates indicate the percentage of trials that completes all sub-tasks, while task progress scores represent the number of sub-tasks a trial is able to complete.

5.2 Main Results: One Order of Magnitude Improvement in Data Efficiency

Refer to caption

Figure 8: Performance scaling for RaC as a function of human-collected frames on real-world tasks. Note that within K=6K=6 rounds for shirt-hanging, K=10K=10 rounds for airtight-lid-sealing, and K=9K=9 rounds for takeout-box-packing, we observe the best-known results for tasks of a similar difficulty from prior work. The top row shows average progress over various sub-tasks, the bottom row shows full long-horizon task success rate. On the right, we compare RaC to various other baseline approaches based on HG-DAgger and cloning full demonstration data, and observe a substantial improvement in data efficiency.

Despite the challenges associated with coherent long-horizon execution, deformable object handling, and contact-rich manipulation, our policies reach high success rates and task progress scores with only modest data requirements. Strikingly, just 5 hours of training data suffice to surpass the full success rates of 75%. To highlight data efficiency gains, consider the shirt-hanging task: prior works [50, 7] report needing thousands of expert demonstrations or more than one hundred hours of teleoperation data to achieve a comparable success rate to RaC. RaC achieves similar or better results with an order of magnitude less data, illustrating its efficacy in scaling imitation learning (Table 1 and Figure 8).

Name Policy Architecture Model Size Training Data Size SR
ALOHA Unleashed [50] Diffusion Transformer policy 217M ∼89\sim\!89 hours (5345 shirt-hanging expert demos) 75.0%75.0\%
Seed GR-3 [7] Vision-Language-Action model 4B 116 hours of shirt-hanging expert demos and vision-language data ∼63.6%\sim\!63.6\%
Ours (RaC) Flow-matching Transformer policy 368M 5 hours (RaC data: expert, recovery, and correction) 78.3%

Table 1: Comparison to similar shirt-hanging tasks in prior work. Under similar task setups and difficulty, the full task success rate (“SR” of RaC policy is higher than other methods using an order of magnitude less data. See Appendix D for details.

Comparisons on real-world tasks. Since scaling up batched data collection across all real-world tasks was infeasible due to the prohibitive costs of collecting such large expert datasets, we instead scaled the batched data collection baseline on one representative task, shirt-hanging. Observe in Figure 6, RaC not only achieves substantially higher absolute performance and task progress, but also delivers at least a 2×\times improvement in data efficiency compared to the batched data collection approach. RaC also consistently outperforms HG-DAgger. This result does not arise from a subpar baseline: our HG-DAgger implementation exhibits performance trends consistent with prior work, such that it outperforms batched data collection under the same amount of human collected data. Finally, we note that RaC exhibits a markedly steeper scaling curve (“higher slope”) than either baseline in Figure 6.

5.3 Examining the Properties of Policies Learned via RaC

Refer to caption

Refer to caption

Figure 9: Performance profiles for RaC and the batched data collection approach. For both real-world shirt-hanging (left two plots) and simulation assembly (right two plots), RaC rapidly reduces the fraction of rollouts that make little progress and steadily shift probability mass toward later sub-task completions and full success. This trend however is not consistent or strong enough for various sub-tasks in the case of batched data collection, which trains on full demonstrations.

Result 1: Robustness of intermediate RaC policies. Having established the efficacy of RaC, we next analyze the properties of the learned policies in a more systematic manner. To this end, we visualize in Figure 9, the distribution of sub-tasks completed by intermediate policy checkpoints produced during successive rounds of human intervention (for RaC) and as we scale data (for batched data collection). We observe that the fraction of on-policy rollouts making little progress rapidly decreases with more rounds when using RaC. In other words, RaC systematically reduces/eliminates the long tail of rollouts that fail or stall early. In contrast, training on increasing amounts of batched full demonstration data does not exhibit the same kind of progress on all sub-tasks, especially in simulation (Figure 9, right). Because our evaluations begin from a broad set of initial configurations, this experiment in a sense highlights the robustness of RaC. To summarize, by explicitly scaling recovery, RaC drives progress even in the difficult “tail” cases, a persistent failure mode that is common lore with imitation learning.

Refer to caption

Figure 10: Test-time scaling of RaC policies with the number of recovery segments. We observe a strong linear scaling relationship between the number of recovery segments upon policy deployment and success rate of policies produced by later rounds of RaC. This is a form of test-time scaling analogous to that in LLMs [35].

Result 2: “o1-style” test-time scaling for robotic policies. Next, we study whether performance scales with more recovery behavior at deployment. To do so, we analyze the subset of evaluation rollouts that successfully solve all sub-tasks across different rounds, and annotate each rollout with the number of recovery attempts it contains. In Figure 10, we show the average number of recovery segments observed against the task success rates. The correlation coefficients rr indicate a linear relationship between the

Refer to caption

Figure 11: Distribution of lengths of successful rollouts for various approaches. Note that RaC policies produce the longest rollouts on average due to the presence of recovery behavior. HG-DAgger produces the second-highest median rollout length.

task success and recovery frequency. In other words, as the policy learns to demonstrate more recovery behaviors, its overall performance improves. To those readers familiar with LLMs, this pattern resembles favorable test-time scaling [35]: just as reasoning LLMs [35, 16] perform better when they produce longer CoTs that illustrate backtracking and error correction, robot policies that scale the number of recovery segments directly in the space of action sequences are likely to succeed more.

Result 3: Rollouts from RaC policies are generally longer, and more successful. On the simulation task, we analyzed the wall-clock duration of successful evaluation rollouts across methods (Figure 11). Successful rollouts from RaC are skewed towards longer lengths, reflecting recovery behaviors that keep the task on track. For RaC, longer length is also correlated with better average performance and more successful rollouts. Successful HG-DAgger rollouts attain the second highest median length, since the robot is trained to still utilize corrective segments to succeed from out-of-distribution states. Policies trained on full demonstration data can likely only succeed when they stay within distribution, resulting in shortest median successful rollout length. Though interestingly, this approach also produces one outlier rollout that attains longer lengths compared to HG-DAgger, where the robot arms kept applying excessive force until the insertion succeeds, without explicitly recovering or correcting from failure states.

5.4 Ablation Studies for the RaC Data Collection Protocol

Finally, we present ablation studies to better understand the properties of the human intervention data collected by RaC across training rounds. In Figure 10, we visualize the composition of intervention data over 4 rounds in the simulation task. We compare data collected using the full RaC approach (“Ours”) and RaC without enforcing ‘recover-then-correct’ (“Ours w/o Rule 1”, i.e. HG-DAgger with only Rule 2). Recall that these Rules were prescribed in Section 4.2.

Refer to caption

Figure 12: Ablation studies on the simulation task. Left: Assessing the composition of human intervention data collected in each round. Note that data collected via RaC maintains a high proportion of recovery segments along with corrective frames. On the other hand, the intervention data collected by HG-DAgger skews heavily towards corrective frames. Right: Utilizing “Rule 2” and terminating the intervention episode early yields better data scaling of performance than continuing the policy rollout after the recover-then-correct intervention is complete. This showcases the importance of both rules prescribed by RaC.

We classify each intervention frame as either a recovery segment or a corrective segment. Observe in Figure 12 (left), that while RaC maintains a roughly balanced ratio of recovery to corrective frames (between 1:1 and 1:2), conventional intervention data exhibits a highly skewed distribution dominated by corrective frames, with recovery frame’s proportion decreasing sharply in later rounds. Specifically, conventional intervention data contains 1:3 and 1:10 proportions of recovery/correction data. The total number of intervention frames naturally decreases as policies improve and require fewer interventions.

Next, we study the effect of Rule 2 in RaC: truncating an episode after human intervention concludes. In the simulation task (Figure 12), we observe that terminating early after an intervention alone (“Ours w/o Rule 1”) yields more effective performance scaling than continuing policy rollouts after human intervention (“Ours w/o Rule 1&2”, i.e., HG-DAgger). We hypothesize that this effect arises because allowing the rollout to continue after human intervention completes contaminates later parts of the trajectory with states influenced by both the human and the policy, producing data that are out-of-distribution for a learned policy. By terminating right after the intervention, we ensure that the collected data cleanly reflect recovery–correction behavior, while subsequent sub-tasks are reached only with the policy’s own distribution in future rounds, leading to more efficient data scaling.

6 Discussion, Conclusion, and Future Work

We presented an approach, RaC, for scaling imitation learning in the real world. Our core idea is to scale not just the quantity of data, but the type of data, explicitly pairing recovery and correction behaviors collected through human interventions. By doing so, we enabled policies to mitigate compounding errors, retry from failures, and achieve substantially higher data efficiency than standard teleoperation or correction-only approaches. Our experiments demonstrated that this paradigm yields robust policies on long-horizon, contact-rich tasks with orders of magnitude less data than prior work and much better data efficiency than our comparisons. We also illustrated a form of “test-time scaling” by showing that more recovery segments and longer action times correlate with higher performance.

Future work. We believe that there are quite a few avenues for future work. First, analogous to how autonomous RL began performing substantially better on top of properly mid-trained initializations for LLMs [45], we believe that policies trained via RaC bear the potential to serve as good initializations for online RL fine-tuning on a real robot. Unlike typical imitation pre-trained policies that attempt to perform “optimal” behavior (and typically lose track upon failing to accomplish the task), we hypothesize that policies from RaC would naturally provide more structured exploration and coverage during online RL due to the presence of recovery behavior. Recovery provides natural “stitching” points [13] which might also be amenable to value-based training. Another interesting direction for future work is to apply RaC on top of generalist vision-language-action (VLA) models [2, 24, 41]. Finally, while prior results do show some examples of recovery behaviors in VLA models, it is unclear if such behaviors systematically emerge in most settings or not, and studying this aspect rigorously (for example, by plotting test-time scaling curves analogous to Figure 10) is also useful for the community.

Acknowledgements

We thank Yuxiao Qu, Bhavya Agrawalla, Lehong Wu, Max Sobol Mark, Anikait Singh, Yufei Wang, Divyam Goel, and Yiran Tao for feedback on an earlier version of this paper. We thank Jason Jingzhou Liu for help with RMPFLow infrastructure. We thank the members of CMU AIRe and RCHI labs for support and feedback. AK thanks Abhishek Gupta, Dhruv Shah, Amrith Setlur, and Max Simchowitz for informative discussions and feedback. This work was supported in part by an Apple seed grant, the Office of Naval Research under N00014-24-12206, and National Institute of Biomedical Imaging and Bioengineering of the National Institutes of Health under award number 1R01EB036842-01. We thank the Babel compute cluster at CMU, the TRC program of Google Cloud, and the National Centre for Supercomputing Applications for providing computational resources that supported this work.

References

Appendices

Appendix A Policy Architecture and Training Details

We train all RaC imitation learning policies with the same model architecture and training configurations detailed below. With the multi-modal DiT (mm-dit) architecture [12], we use two separate modalities, i.e. two sets of transformer weights to model action generation conditioned on robot observations. The first set of transformer weights processes robot observations, including image tokens from the three camera views after ResNet encoders and a robot proprioceptive state token after a MLP encoder. The second set of transformer weights processes noised action tokens. mm-DiT joins the sequences of the two modalities for the attention operation, such that both representations can work in their own spaces while taking the other one into account. This design is similar to the action expert in [2].

ResNet encoders used in this work finetune on weights pre-trained on ImageNet.

All model trainings are conducted on 4-cards of RTX 6000 Ada GPU servers or 8-cards of L40S GPU servers.

Table 2: Model training configurations. Training hyperparameters used for all experiments.

Config Value
Optimizer AdamW (default)
Learning Rate 1×10−41\times 10^{-4} (const.)
Global Batch Size 512
Training Length 200 epochs
State Dimension 40
Action Dimension 14
Action Horizon 60

Table 3: Flow-matching policy details. Architecture specifications of our policy model.

Detail Value
MM-DiT modalities [12] 2
Flow Matching Steps 10
MM-DiT Hidden Size 768
MM-DiT Depth 12
MM-DiT Heads 12
Vision Encoder ResNet-50 (real) / ResNet-18 (sim)
Total Parameters 367.865M

Appendix B Example Rollouts on Various Tasks

Refer to caption

Figure 13: _RaC rollout on the shirt-hanging task. In this task, recovery corresponds to driving the gripper and hanger backwards and correction corresponds to reinserting the hanger again.

Refer to caption

Figure 14: _RaC rollout on the airtight-container-lid-sealing task. In this task, recovery corresponds to driving the gripper and hanger backwards and correction corresponds to reinserting the hanger again.

Refer to caption

Figure 15: _RaC rollout on the takeout-box-packing task. In this task, recovery corresponds to driving the gripper backwards and correction corresponds to regrasping the spatula again.

Appendix C Bimanual Manipulation Tasks Evaluation Protocols

For computing the confidence interval when reporting results and producing the scaling curves, we compute the 95%95\% confidence interval for the task progress scores, where the max scores equal to the maximum number of sub-tasks within each task. For the full task success rates, where each trial receives a binary score for whether the robot completed the entire task successfully, we compute the 95%95\% Wilson score interval, i.e. a formula for binomial proportion confidence interval.

Appendix D Comparison to Prior Works on the Shirt-Hanging Task

ALOHA Unleashed. In ALOHA Unleashed [50], the shirt-hanging task is performed with bimanual ALOHA robot [49] at two difficulty levels: ShirtEasy and ShirtMessy. ShirtEasy uses 5345 full trajectories and ShirtMessy uses 3313 full trajectories, with a fleet of robots and expert teleoperators. In our work, the shirt-hanging task is designed to be as close to ShirtEasy as possible. They report a full task success rate of 75%75\% on the ShirtEasy task with Diffusion Policy trained on both the ShirtEasy and ShirtMessy data. To standardize the comparison of the size of the data between different works, we approximate the length of the ShirtEasy dataset in hours from ALOHA Unleashed by using an average of 11 minutes per trajectory. Thus, we estimate a total of 5345∗60/3600≈895345*60/3600\approx 89 hours for the ShirtEasy dataset.

Seed GR-3. In Seed GR-3 [7], the shirt-hanging task is performed on a custom-designed bimanual mobile manipulation platform. The task differs from ours and [50] in the final step, where the robot “needs to rotate its mobile base from the table to the drying rack to hang the clothes”, while other sub-tasks remain largely consistent. Importantly, Seed GR-3 reports their performance in average task progress, where a full success corresponds to 1.01.0 or 100%100\% and successful completion of each sub-task contributes a fractional score towards the overall task progress. This is different from the success rate metric (Table 1), where only full success trials are given score of 1.01.0 and other trials do not receive any partial credit. To standardize the evaluation metrics, since ALOHA Unleashed[50] does not report task progress scores, we estimate the full task success rate for GR-3[7] using the Sankey diagram displayed in Figure 10 of their paper, by dividing the vertical heights of the bar representing the last sub-task by the vertical height of the figure location representing the start. This results in a ratio of 7/11≈0.6367/11\approx 0.636.