RoboVerse: Towards a Unified Platform, Dataset and Benchmark for Scalable and Generalizable Robot Learning (original) (raw)

Haoran Geng1*, Feishi Wang1,2,3*, Songlin Wei2*, Yuyang Li2,9*, Bangjun Wang3*, Boshi An2*,
Charlie Tianyue Cheng1*, Haozhe Lou3, Peihao Li1,4, Yen-Jen Wang1, Yutong Liang2, Dylan Goetting1,
Chaoyi Xu2, Haozhe Chen5, Yuxi Qian6, Yiran Geng2, Jiageng Mao3, Weikang Wan2, Mingtong Zhang3,
Jiangran Lyu2, Siheng Zhao3, Jiazhao Zhang2, Jialiang Zhang1,2, Chengyang Zhao7, Haoran Lu2,
Yufei Ding1,2, Ran Gong8, Yuran Wang2, Yuxuan Kuang2,3, Ruihai Wu2, Baoxiong Jia9, Carlo Sferrazza1,
Hao Dong2, Siyuan Huang9††\dagger†, Yue Wang3††\dagger†, Jitendra Malik1††\dagger†, Pieter Abbeel1††\dagger†

1UC Berkeley 2PKU 3USC 4UMich 5UIUC 6Stanford 7CMU 8UCLA 9BIGAI
* equal contribution ††\dagger† equal advising Correspondence to: Haoran Geng <ghr@berkeley.edu

Abstract

Data scaling and standardized evaluation benchmarks have driven significant advances in natural language processing and computer vision. However, robotics faces unique challenges in scaling data and establishing reliable evaluation protocols. Collecting real-world robotic data is resource-intensive and inefficient, while benchmarking in real-world scenarios remains highly complex. Synthetic data and simulation offer promising alternatives, yet existing efforts often fall short in data quality, diversity, and benchmark standardization. To address these challenges, we introduce RoboVerse, a comprehensive framework comprising a simulation platform, a synthetic dataset, and unified benchmarks. Our simulation platform supports multiple simulators and robotic embodiments, enabling seamless transitions between different environments. The synthetic dataset, featuring high-fidelity physics and photorealistic rendering, is constructed through multiple approaches including migration from public datasets, policy rollout, and motion planning, etc. enhanced by data augmentation. Additionally, we propose unified benchmarks for imitation learning and reinforcement learning, enabling consistent evaluation across different levels of generalization. At the core of the simulation platform is MetaSim, an infrastructure that abstracts diverse simulation environments into a universal interface. It restructures existing simulation environments into a simulator-agnostic configuration system, as well as an API aligning different simulator functionalities, such as launching simulation environments, loading assets with initial states, stepping the physics engine, etc. This abstraction ensures interoperability and extensibility. Comprehensive experiments demonstrate that RoboVerse enhances the performance of imitation learning, reinforcement learning, and world model learning, improving sim-to-real transfer. These results validate the reliability of our dataset and benchmarks, establishing RoboVerse as a robust solution for advancing simulation-assisted robot learning. Code and dataset can be found at:https://roboverseorg.github.io/.

[Uncaptioned image]

Figure 1: RoboVerse comprises a scalable simulation platform, a large-scale synthetic dataset, and unified benchmarks. The simulation platform supports seamless integration of new tasks and demonstrations through unified protocols, ensuring flexibility and extensibility. The dataset includes over 1,000 diverse tasks and more than 10 million transitions, constructed through large-scale data migration, cross-embodiment transfer, and robust augmentation and randomization.

I Introduction

Large-scale datasets, combined with well-established benchmarks, have fueled rapid advancements in natural language processing (NLP) [97, 5] and computer vision (CV) [23, 59, 57, 99, 70, 43]. Specifically, large-scale data provides ample training examples that bolster learning, while uniform benchmarks enable standardized evaluation and fair comparison across different methods. However, replicating these successes in robotics remains challenging due to the difficulty of collecting high-quality, diverse data and the lack of widely recognized evaluation protocols.

Real-world approaches [15, 54] to constructing datasets and benchmarks, though authentically reflecting the complexities of operational environments, face significant practical constraints. First, collecting demonstrations is time-consuming and resource-intensive, and the resulting data is often hardware-dependent or modality-specific, limiting its adaptability to new scenarios. Additionally, establishing standardized and widely applicable benchmarks is inherently challenging since reproducing identical conditions for fair comparisons is nearly impossible. For instance, object placements can vary across rollouts, ambient lighting fluctuates under natural sunlight, and background environments may change. Consequently, scaling real-world datasets, evaluating policies, and iterating development in real-world scenarios remain cost-prohibitive and difficult to standardize.

Simulators, on the other hand, present a promising alternative for large-scale dataset and benchmark construction. By providing efficient computation, synthetic assets, and omniscient information in reproducible settings, they enable cost-effective dataset construction and consistent performance evaluation. Recent works, exemplified by [138, 50, 10, 33, 102, 127, 73, 62, 124, 138, 119, 64, 65, 95], have demonstrated the potential of simulation-based methods in various robotic tasks. Despite these advantages, several challenges impede the broader adoption of synthetic datasets and benchmarks. First, utilizing simulators often demands considerable expertise due to both the complexity of simulator design and the relative immaturity of many platforms, which complicates the data construction process. Second, simulators vary widely in their internal architectures and external interfaces, making it laborious to transfer data and models or adapt workflows from one to another. Consequently, reusing existing synthetic datasets and benchmarks is difficult, resulting in a fragmented ecosystem that further hinders convenient construction and effective use of large-scale data in simulation environments.

To fully harness the potential of simulation in robotics, we introduce RoboVerse, a scalable simulation platform that unifies existing simulators under a standardized format and a single infrastructure, a large-scale synthetic dataset, and unified benchmarks. To achieve this, we first propose MetaSim, the core infrastructure of the RoboVerse. Through careful design, MetaSim establishes a universal configuration system for agents, objects, sensors, tasks, and physics parameters while exposing a simulator-agnostic interface for simulation setup and control. This architecture enables seamless integration of tasks, assets and robot trajectories from diverse simulation environments with minimal adaptation effort.MetaSim provides three key capabilities: (1) Cross-Simulator Integration: Enables seamless switching between different simulators, fostering unified benchmarking and facilitating the transfer of environments and demonstrations across platforms. (2) Hybrid Simulation: Combines the strengths of multiple simulators—such as pairing advanced physics engines with superior renderers—to generate scalable and high-quality synthetic data. (3) Cross-Embodiment Transfer: Allows the retargeting of trajectories across various robot arms with parallel grippers, maximizing dataset reuse from heterogeneous sources.

MetaSim enables RoboVerse to systematically enhance the workflow for building and scaling simulation environments and datasets. Our method features:

Leveraging these workflows in RoboVerse, we construct the largest and most diverse high-quality synthetic dataset and benchmark to date, all in a unified format. This dataset includes ∼similar-to\sim∼500k unique, high-fidelity trajectories covering 276 task categories and ∼similar-to\sim∼5.5k assets. Additionally, we generate over 50 million high-quality state transitions to support policy learning.

Beyond dataset and benchmark construction, we explore the potential of RoboVerse through extensive experiments on imitation learning (Sec. VI-B), reinforcement learning (Sec. VI-C), and world model learning (Sec. VI-E). Our results demonstrate that RoboVerse enables reliable policy learning and evaluation, supports strong sim-to-sim and (Sec. VI-G) sim-to-real transfer (Sec. VI-F) via high-fidelity physics and rendering, and facilitates efficient data expansion through teleoperation (Sec. IV-C), trajectory augmentation (Sec. IV-D1), domain randomization (Sec. IV-D2) and generative models (Sec. • ‣ IV-C). These findings highlight the framework’s robustness, scalability, and real-world applicability.

II-A Robotics Simulators

Advancements in computer graphics have contributed to the development of high-fidelity simulators, which are widely used in robotics research and development. CoppeliaSim [101], Bullet [16], and MuJoCo [114] provide accurate physics simulations and are extensively utilized in applications such as reinforcement learning and robotic benchmarking [3, 129, 90, 14]. More simulators have been developed to fully exploit parallelism for better efficiency. Isaac Gym [75], Isaac Sim [88], SAPIEN [37, 112], MuJoCo MJX [114, 135], and Genesis [2] utilize GPU power for enhanced performance, enabling large-scale reinforcement learning and efficient data collection, significantly improving training speed and scalability. Some simulators focus on bridging the simulation-reality gap (Sim-to-Real Gap), incorporating technologies including ray-tracing and customized renderers for photo-realistic rendering [88, 112]. Furthermore, Isaac Sim [88] and Genesis [2] offer high-fidelity soft-body and liquid simulation, expanding the scope of realistic robotic interactions. RoboVerse proposes a unified platform that supports multiple simulators, facilitating seamless transitions between them and enabling hybrid integration to utilize the strengths of each simulator.

II-B Large-Scale Robotics Dataset

The scarcity of large-scale, high-quality, and diverse datasets in the robotics community has long been recognized. Several works have shown the possibility of collecting demonstration data directly on real robots. RoboNet [20] is a large-scale manipulation dataset containing roughly 162k trajectories from multiple robot platforms. DROID [54] has collected over 76k contact-rich robotic manipulation demonstrations across 86 tasks. RH20T [28] proposed a dataset with over 100k demonstrations and 147 tasks. At the same time, RT-1 [4] set the record further to 130k demonstrations on over 700 tasks. Recently, Open X-Embodiment [15] has demonstrated a promising approach to unite the community’s efforts, collecting over 1M trajectories on 160,266 tasks with 22 different embodiments. At this stage, real-world datasets became difficult to scale up due to the proportional effort and cost required to collect more demonstrative trajectories.

Simulation-based data collection provides a promising solution to the high cost and inefficiencies of real-world datasets. Hussing et al. [46] proposed a dataset containing 256M transitions on 256 tasks for offline compositional reinforcement learning. RoboCasa [85] introduced a dataset of 100 tasks and over 100k trajectories for generalist robots. DexGraspNet-2.0 [137] has collected over 400M demonstrations for dexterous grasping. Despite these efforts, synthetic datasets often exist in disparate simulators, leading to a fragmented ecosystem with limited diversity and quality. Moreover, simulation-based data often fails to capture complex physics and diverse task variations found in the real world [66, 26], potentially causing overfitting to specific simulators and hampering generalization to real-world scenarios.

RoboVerse provides a unified solution for large-scale, high-quality, and diverse synthetic data. It enables agents to train on a large set of environments and simulators to reduce overfitting, thereby improving the robustness of the learned policies.

II-C Benchmarking in Robotics

Benchmarking remains a critical yet highly challenging problem in the robotics community. Compared to supervised learning tasks, it is relatively difficult to evaluate the performance of a robotics model. Meta-World [134] is an early attempt in multi-task benchmarking. This is followed by RLBench [48], BEHAVIOR-1K [63], Habitat [111], and ManiSkill [84, 37, 112, 107], covering a large variety of robotic tasks. Grutopia [120] and InfiniteWorld [100] make a leap toward general-purpose robot benchmarking.

Despite significant efforts dedicated to these benchmarks, it is not guaranteed that the results are reproducible across different benchmarks. The uncertainty comes from multiple aspects including simulation accuracy, rendering style and asset properties [66, 26]. To address these challenges, RoboVerse enables researchers to evaluate their policies across multiple benchmarks and simulators seamlessly, without familiarizing themselves with each one individually.

III Infrastructure: MetaSim

III-A MetaSim Overview

Refer to caption

Figure 2: RoboVerse consists of a simulation platform, a large-scale, high-quality dataset, and unified benchmarks. At the core of the simulation platform is MetaSim, the infrastructure of RoboVerse. Powered by MetaSim, the simulation platform facilitates dataset creation and benchmark construction.

Refer to caption

Figure 3: MetaSim provides a universal configuration system, aligned simulator backends, and a Gym [115] environment wrapper. This three-layer architecture abstracts simulation environments into simulator-agnostic specifications and aligns simulator backends, enabling three key capabilities: cross-simulator integration, hybrid simulation and cross-embodiment transfer. Based on MetaSim, we build a pipeline to collect tasks, assets and trajectories from diverse public sources in a unified format, employ data augmentation methods, and ultimately generate a large-scale high-quality dataset along with unified benchmarks. This data pipeline forms the foundation of RoboVerse, facilitating the generation of large-scale datasets and construction of unified benchmarks.

We present MetaSim, a high-level interface above specific simulation environment implementations. It is also the core infrastructure of RoboVerse. As illustrated in Fig. 2, MetaSim empowers the RoboVerse simulation platform, allowing for the generation of a large-scale high-quality dataset, as well as the construction of a unified benchmark.

III-B MetaSim Implementation

As illustrated in Fig. 3, MetaSim employs a three-layer architecture including a universal configuration system, a simulator-agnostic interface, and a user-friendly environment wrapper. The universal configuration system unifies specifications for a simulation scenario and ensures consistent format across simulators. The simulator-agnostic interface interprets these specifications, translates them into simulator-specific commands, and therefore aligns different simulator backends. In addition, the environment wrappers encapsulate the simulator-agnostic interface into a standarized learning environment, such as a Gym [115] environment. We describe each layer with more details in the following sections.

III-B1 Universal Configuration System

Refer to caption

Figure 4: The MetaConfig is a nested dataclass that abstracts the core components in any simulation environment in a simulator-agnostic way.

A typical simulation environment comprises agents, objects, tasks, sensors, and physics parameters. They collectively define who performs the actions (agents), what the environment looks like (objects), what the agents should do (tasks, including instructions, success metrics, and rewards), how the environment is perceived and measured (sensors), and the governing physical laws (physics parameters). Ideally, these components should be simulator-agnostic, requiring a unified standard of simulation scenarios. Such a standard would enable researchers to work across different simulators seamlessly and integrate existing efforts from the community through cross-simulation.

Based on such a principle, we design a configuration system, MetaConfig, to abstract simulation scenarios in a simulator-agnostic way. As illustrated in Fig. 4, MetaConfig is a nested class that contains the above-mentioned core components. It can be interpreted by different simulator backends to build the corresponding simulation. Additionally, MetaConfig supports optional simulator-specific hyperparameters (e.g., solver type), allowing fully leveraging the unique features of different simulators through customization.

III-B2 Aligned Simulator Backends

Different simulators have their own implementations and specializations. However, routine operations – such as initializing a scene, loading objects, stepping the physics engine, retrieving observations, time management, and determining success states – tend to follow similar patterns. To standardize these shared operations, we create a unified interface through a Handler class. Each simulator has its own handler instance implementing this interface. The handler class implements the common methods including launch(), get_states(), and set_states(), etc., spanning the whole lifecycle of simulating a task. The usage of the APIs is illustrated in Code III-B2. More information is provided in the supplementary materials.

language = Python, basicstyle = , backgroundcolor = , stringstyle = , keywordstyle = , keywordstyle = [2], otherkeywords = , morekeywords = [2]handler, breakatwhitespace=false, breaklines=true, keepspaces=true, frame=bt, framerule=0pt, framextopmargin=3mm, framexbottommargin=3mm, framexleftmargin=5mm, xleftmargin=5mm,

{lstfloat}

[bt]

class Env:

def __init__(self, handler):

self.handler = handler

handler.launch()

def reset(self):

handler.set_states()

states = handler.get_states()

return get_observation(states), \

handler.get_extra()

def step(self, action):

handler.set_states(action=action)

handler.step()

states = handler.get_states()

return get_observation(states), \

get_reward(states), \

get_success(states) \

get_termination(states), \

get_time_out(states), \

handler.get_extra()

def render(self):

return handler.render()

def close(self):

handler.close()

Pseudocode for gym.Env implementation. Each method of gym.Env is implemented by calling the corresponding methods of the Handler class.

III-B3 User-Friendly Environment Wrapper

Gym [115] is a widely adopted paradigm in reinforcement learning and robotics, in which the gym.Env class is fundamental to building learning environments. We define a wrapper to easily transform a Handler into an environment equipped with Gym APIs (step(), reset(), render(), and close()). As shown in Code III-B2, these methods are implemented by leveraging the underlying Handler methods.

III-C MetaSim Capabilities

MetaSim offers the following three key capabilities.

III-C1 Cross-Simulator Integration

Seamlessly switching between different simulators, allowing tasks and trajectories from one simulator to be utilized in other simulators. This capability enables efficient task and trajectory integration, unified benchmark construction, and sim-to-sim transfer for reinforcement learning training. For example, tasks from Meta-World [134] can be used by Isaac Gym [75] for fast parallel training, after which the generated trajectories can be deployed in Isaac Sim [88] for rendering.

III-C2 Hybrid Simulation

MetaSim supports combining the physics engine of one simulator and the renderer of another simulator at the same time, allowing users to benefit from advantages owned by different simulators. Specifically, using a single command, one could launch a simulator with a powerful renderer (e.g., Isaac Sim [88]) with a simulator that has an accurate physics engine (e.g., MuJoCo [114]) to form an even more powerful simulation, enabling high-quality data generation.

III-C3 Cross-Embodiment Transfer

Reusing the trajectories across different gripper-based robot morphologies by retargeting the end-effector pose, which allows the integration of data collected from diverse robots into a unified format.

IV RoboVerse Dataset

IV-A Dataset Overview

On top of MetaSim, we generate large-scale high quality dataset by incorporating multiple data collection methods. Overall, there are three key data types to collect: tasks, assets, and robot trajectories. The main source of these data is migration from existing simulation environments. Beyond migration, we explore various methods to collect these data, such as using large language models to generate new tasks, leveraging the real-to-sim toolset [71]to reconstruct assets from the real world, using teleoperation to collect new trajectories, etc. Additionally, we leverage data augmentation methods for both trajectories and visual observations. Finally, we report the statistics for current progress of data migration in RoboVerse.

IV-B Tasks, Assets and Trajectories Collection: Migration

Leveraging the RoboVerse format and infrastructure, we seamlessly integrate a wide range of benchmarks and datasets into our system with a unified format and clean codebase. We apply the following approaches to collect tasks and demonstrations.

With the techniques mentioned above, we migrated multiple existing manipulation datasets into RoboVerse. Currently, we support ManiSkill [84, 37, 112], RLBench [48], CALVIN [82], Meta-World [134],robosuite [145], MimicGen [79], GAPartNet [34], Open6DOR [24], ARNOLD [36], LIBERO [68], SIMPLER [66], GraspNet [27], GarmentLab [72], and UniDoorManip [67].

We also integrated datasets from a wider range of embodiments, including dexterous hands, quadrupeds, and humanoids, covering tasks such as dexterous manipulation, locomotion, navigation, and whole-body control. Currently, we have migrated VLN-CE R2R [58] and RxR [60] for navigation, as well as HumanoidBench [106] and Humanoid-X [80] for locomotion and whole-body control.

RoboVerse simplifies and standardizes the migration process, and we will continue to maintain and expand it.

IV-C Tasks, Assets and Trajectories Collection: Teleoperation and Generation

IV-D Data Augmentation

IV-D1 Trajectory Augmentation

With the unified simulation interface and data format, RoboVerse enables significantly more efficient data augmentation and supports advanced augmentation techniques. Beyond the visual randomization detailed in Benchmark Protocol [8], we also provide robust trajectory space augmentation. We offer an API to generate large-scale robot trajectory datasets from a limited number of source demonstrations. Following the MimicGen [79]framework, for most tasks, we can decompose them into a sequence of object-centric subtasks (S1⁢(oS1),S2⁢(oS2),…,SM⁢(oSM))subscript𝑆1subscript𝑜subscript𝑆1subscript𝑆2subscript𝑜subscript𝑆2…subscript𝑆𝑀subscript𝑜subscript𝑆𝑀(S_{1}(o_{S_{1}}),S_{2}(o_{S_{2}}),\dots,S_{M}(o_{S_{M}}))( italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_o start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_o start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , … , italic_S start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_o start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ), where the robot’s trajectory within each subtask Si⁢(oSi)subscript𝑆𝑖subscript𝑜subscript𝑆𝑖S_{i}(o_{S_{i}})italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_o start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) is relative to a single object’s coordinate frame (oSi∈𝒪subscript𝑜subscript𝑆𝑖𝒪o_{S_{i}}\in\mathcal{O}italic_o start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∈ caligraphic_O, 𝒪𝒪\mathcal{O}caligraphic_O is the set of objects in the task ℳℳ\mathcal{M}caligraphic_M). Additionally, we assume that the sequence of subtasks in each task is predefined. By leveraging this minimal human annotation regarding the order of subtasks, we can efficiently divide each source demo into contiguous object-centric manipulation segments {τi}i=1Msuperscriptsubscriptsubscript𝜏𝑖𝑖1𝑀\{\tau_{i}\}_{i=1}^{M}{ italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT (each of which corresponds to a subtask Si⁢(oi)subscript𝑆𝑖subscript𝑜𝑖S_{i}(o_{i})italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )) using a simulator, and then generate extensive trajectory datasets for various task variants (in our case: variations in the initial and goal state distributions of objects (D𝐷Ditalic_D) and robots (R𝑅Ritalic_R)) using MimicGen [79]. This approach has been shown to significantly benefit generalization in imitation learning [79, 50, 121, 31, 85], particularly in scenarios where the number of source demonstrations is limited. For further details, please refer to the supplementary materials.

IV-D2 Domain Randomization

We implement domain randomization in the Isaac Sim [88] handler of MetaSim. This involves four types of randomization:

These randomization options can be freely combined. For example, a scene can include a customized table, walls with a ceiling, and a set of cylinder lights to simulate an indoor environment. For details, please refer to the supplementary materials.

IV-E RoboVerse Dataset

TABLE I: Migration progress statistics for manipulation tasks in RoboVerse

Source Benchmark Source Simulator # Task Categories # Trajectories # Assets ManiSkill [84, 37, 112] SAPIEN 6 19k 1.7k RLBench [48] CoppeliaSim 80 150k 100 CALVIN [82] Pybullet 7 20k 7 MetaWorld [134] MuJoCo 5 5k 6 RoboSuite [145]&MimicGen [79] MuJoCo 6 6k 12 GAPartNet [34] IsaacGym 4 4k 151 Open6DOR [24] IsaacGym 69 10k 207 ARNOLD [36] IsaacSim 6 3k 30 LIBERO [68] MuJoCo 10 15k 15 Simpler [66] SAPIEN 6 30k 52 RLAfford [35] IsaacGym 4 40k 40 GraspNet [27] - 58 200k 42 GarmentLab [72] IsaacSim 6 6k 3k UniDoorManip [67] IsaacGym 7 1k 140 GAPartManip [18] IsaacSim 2 1.5k 42 Total - 276 510.5k 5.5k

Refer to caption

Figure 8: Dataset Comparison and Gallery. Left: other representative synthetic robotics datasets. Right: the RoboVerse dataset.

IV-E1 Dataset Statistics

Manipulation Dataset

We migrate diverse manipulation datasets from existing source benchmarks [84, 37, 112, 48, 82, 134, 145, 79, 34, 24, 36, 68, 66, 35, 27, 72, 67, 18]into RoboVerse. The number of task categories, trajectories and assets contributed by each source benchmarks is summarized in Tab. I. In total, this migration results in 276 task categories, 510.5k trajectories, and 5.5k assets. Representitive tasks with rich domain randomization are shown in Fig. 8.

We migrate vision-and-language navigation (VLN) tasks into RoboVerse. Note that there exists various VLN tasks with different settings; here, we particularly focus on VLN in continuous environments (VLN-CE) [58], as it more closely resembles real-world scenarios [11, 139, 140]. Specifically, we construct our dataset based on RoboVerse by integrating MatterPort 3D scenes [9] (90 scenes) and off-the-shelf instructions from R2R [58] (10k episodes) and RxR [60] (20k episodes). We provide two types of mobile embodiments, including the Unitree Dog (a legged robot) and the JetBot (a wheeled robot), which support different control policies. A detailed elaboration on the navigation dataset is provided in the supplementary materials.

Humanoid Dataset

We migrate HumanoidBench [106] tasks for reinforcement learning benchmarks and integrate tasks, policies, and data samples from Humanoid-X [80] and SkillBlender [61]. Additionally, we re-implement the UH-1 inference pipeline within our framework. The pretrained policy successfully enables humanoid robots to follow demonstrated poses while maintaining stable locomotion across multiple simulators based on RoboVerse.

V RoboVerse Benchmark

Refer to caption

Figure 9: Benchmark Protocol: We define a four-level generalization benchmarking protocol, allocating 90% of the data for training and 10% for generalization evaluation. From left to right, Levels 00 to 3333 corresponds to task space generalization, environment radomization, camera randomization, lighting and reflection randomization, respectively.

V-A Benchmark Overview

With the collected tasks, assets, and trajectories, RoboVerse establishes standardized benchmarks for robot learning, including both imitation learning and reinforcement learning. We define a unified training and evaluation protocol within the RoboVerse platform and implement standardized baselines and learning frameworks for benchmarking. Specifically, for imitation learning, we introduce different levels of generalization benchmarks to assess the generalization capability of models.

V-B Imitation Learning Benchmark

For each imitation learning benchmark, we establish a standardized evaluation framework with a fixed set of demonstrations and a controlled evaluation environment. Policies must be trained exclusively on the provided training data and assessed within this environment to ensure fair comparison. To rigorously test generalization capability, we curate training data from specific domains and evaluate policies on unseen samples, challenging their adaptability to novel scenarios. We systematically categorize visual generalization factors into multiple levels, including task space generalization, environment setup generalization, camera setting generalization, and lighting and reflection generalization. Each level introduces controlled variations to assess a policy’s adaptability and robustness in increasingly diverse and challenging conditions.

Level 0: Task Space Generalization

We establish a controlled evaluation by standardizing the environment with consistent camera, materials, lighting, and other parameters. The task space, including object initialization and instructions, is split into 90% training and 10% validation to assess generalization within a fixed setting, as shown in Fig. 9(a).

Level 1: Environment Randomization

Building on the standardized setup, we introduce scene randomization while keeping the camera, materials, and lighting fixed [81]. By varying house, table, and ground configurations, we create diverse visual inputs to test robustness against environmental changes [51]. A fixed set of predefined randomized scenes ensures structured evaluation, as shown in Fig. 9 (b).

Level 2: Camera Randomization

To assess generalization across camera variations, we introduce different viewing heights and angles using carefully annotated, realistic camera poses. Following the 90/10 training/testing split, we ensure consistent and rigorous evaluation, as illustrated in Fig. 9 (c).

Level 3: Lighting and Reflection Randomization

Real-world environments involve diverse materials and lighting conditions [116]. To simulate these challenges, we randomize lighting and reflections, curating realistic object materials and illumination setups [19]. This enhances robustness testing under varying conditions, as shown in Fig. 9(d).

V-C Reinforcement Learning Benchmark

In addition to imitation learning, RoboVerse offers a comprehensive reinforcement learning (RL) benchmark designed to accommodate a diverse range of tasks, robot embodiments, and simulation backends. Specifically, we integrate the PPO [105] algorithm from both Stable-Baselines3 [98]and rsl_rl [102] into our MetaSiminterface, enabling straightforward task definition, seamless environment switching, and standardized performance logging.

Building upon this infrastructure, we have successfully ported multiple humanoid control tasks from the HumanoidBench [106] benchmark into RoboVerse. Through our adapted interface for rsl_rl [102], we have efficiently extended framework compatibility to support the TD-MPC2 [41, 42]algorithm from the original benchmark while preserving implementation fidelity.

VI Experimental Results

VI-A Overview

We conduct extensive experiments to validate the effectiveness and practicality of RoboVerse. First, we evaluate baselines on representative tasks from various benchmark sources to ensure the reliability of the collected datasets and established benchmarks. This includes assessments of both imitation learning baselines Sec. VI-B and reinforcement learning baselines Sec. VI-C.

Then we further demonstrate the strength of the high-quality synthetic dataset. We find that synthetic data could significantly boost world model learning.

VI-B Results on the Imitation Learning Benchmark

TABLE II: Baseline Results on RoboVerse Imitation Learning Benchmark.We report baseline results on representative tasks from various benchmark sources to validate the effectiveness and reliability of the RoboVerse benchmark.

Representative Task PickCube StackCube CloseBox MoveSliderLeft PickChocolatePudding NutAssembly Average
Benchmark Source ManiSkill ManiSkill RLBench CALVIN LIBERO RoboSuite -
Diffusion Policy [13] 78M 52.7 53.8 51.5 76.5 50.0 7.1 48.6
ACT [141] 84M 31.7 36.7 68.3 85.0 78.3 0.0 50.0

TABLE III: Generalization Performance on Imitation Learning Benchmark. This table presents the experimental results for each generalization level in our benchmark across different tasks and methodologies. The tasks are divided into distinct levels (Level 0, Level 1, Level 2, and Level 3) to evaluate performance under progressively challenging scenarios.

Task and Generalization Level MoveSliderLeft CloseBox PickCube
Level 0 Level 1 Level 2 Level 3 Level 0 Level 1 Level 2 Level 3 Level 0 Level 1 Level 2 Level 3
Diffusion Policy [13] 76.5 81.3 72.0 60.0 51.5 42.8 20.0 10.4 52.7 11.1 0.0 0.0
ACT [141] 85.0 83.3 43.3 16.6 68.3 73.3 0.0 20.0 31.7 30.0 6.7 3.3
OpenVLA111Due to resource and time constraints, we uniformly sample 20 testing scenarios for the OpenVLA baseline. [56] 45.0 40.0 35.0 30.0 0.0 0.0 0.0 0.0 40.0 15.0 0.0 0.0

Method Simple Language-conditioned Grasping PickCube MoveSliderLeft Object Set 1 Object Set 2 Object Set 3 OpenVLA [56] 40.0 45.0 46.0 33.3 14.4 Octo [89] 50.0 30.0 42.0 14.4 2.2

TABLE IV: Vision-Language-Action (VLA) Model Results on RoboVerse Imitation Learning Benchmark. Constrained with time and resources, we report VLA models’ results on two simple tasks from RoboVerse and grasping tasks with diverse and challenging language instructions. We split 58 objects in GraspNet into three sets, each containing progressively more challenging objects based on their geometry.

VI-B1 Baseline and Task Selection

To genuinely reflect the data quality of the RoboVerse dataset and provide a standard benchmark for all kinds of imitation learning policy models, we select both prevailing specialist and generalist models as baselines of our RoboVerse benchmark. Specifically, for specialist models, we integrate ACT [141] and Diffusion Policy [13]. For generalist models, We benchmark our approach on OpenVLA [56] and Octo [89], both of which we fine-tuned using our synthetic dataset. ACT is one of the most widely used methods in bi-manual manipulation. Diffusion Policy [13] is the first work that applies the conditional denoising diffusion process as a robot visuomotor policy and achieves great generalization capabilities.

Leveraging the RoboVerse format and infrastructure design, we are able to evaluate models on different tasks within a unified platform. To fully test policy models’ performance under versatile settings, we select one representative task from each of the source benchmarks integrated by the RoboVerse dataset as shown in Tab. II. The experiment subset includes PickCube and StackCube from ManiSkill [84], CloseBox from RLBench [48], MoveSliderLeft from CALVIN [82], PickChocolatePudding from LIBERO [68], and NutAssembly from robosuite [145]. These tasks not only demand precise pick-and-place skills but also require contact-rich physical interactions with articulated objects. Through these tasks, the benchmark results can provide a comprehensive reflection of each model’s performance under different scenarios.

VI-B2 Implementation Details

Due to time and resource constraints, we implement specialist and generalist models using different strategies, and all the results are obtained under the single-task setting. The training and evaluation settings follow the 90/10901090/1090 / 10 RoboVerse benchmark protocol as specified in Sec. V-B. During evaluations, we randomly select ten task settings from training sets and another ten from the validation sets. The reported success rates are computed as the averages over three random seeds.

For each step, the inputs are 256×256×32562563256\times 256\times 3256 × 256 × 3 RGB images and a short language description depending on the task settings. For specialist models, we train from scratch with action in 9999-dim robot joint state space. For generalist models, the action is pre-processed into delta end-effector position space from absolute end-effector position space, and The gripper action is discretized into binary values {0,+1}01\{0,+1\}{ 0 , + 1 }. Owing to the lack of time and resources, we are only able to fine-tune the generalist models in the single-task setting. During evaluations, we employ cuRobo [110] as the inverse-kinematics solver to transform the action to robot joint state space. Specific model implementation details and hyperparameters are provided in supplementary materials.

VI-B3 Experiment Results

We present the imitation learning benchmark results in Tab. II and the generalization evaluation in Tab. III. We further fine-tune large vision-language-action models on both simple and complex language-conditioned tasks, as shown in Tab. VIII.

VI-C Results on the Reinforcement Learning Benchmark

Using Stable-Baselines3 [98] and rsl_rl [102] implementations of PPO, we train policies on tasks from IsaacLab [83] under consistent hyperparameters.

For additional tasks (humanoid, dexterous hand), the same PPO-based workflow applies. We successfully migrate the HumanoidBench [106] from MuJoCo to RoboVerse, enabling training across multiple simulators (Isaac Sim and MuJoCo) with consistent interfaces. Experiment results demonstrate stable policy convergence across simulators, achieving comparable performance to native MuJoCo baselines. Leveraging the generalizability of rsl_rl [102], we further extend the benchmark to support TD-MPC2 [41, 42] algorithm , which exhibits robust training dynamics in all environments. For implementation details, reward curve, and extended experimental results, please refer to the supplementary materials.

VI-D Augmentation Experiments

To verify the effectiveness of our trajectory augmentation API, on four representative tasks, we compare the success rates of trained Diffusion Policy on 50 source demonstrations and 200, 1000, and 3000 generated augmentation demonstrations under the imitation learning setting. The results presented in Fig. 10 demonstrate a consistent improvement in model performance as the number of generated data increases, highlighting both the effectiveness and scalability of the trajectory augmentation API.

Refer to caption

Figure 10: Effectiveness of Trajectory Augmentation. Success rates of policy trained with augmented dataset and source dataset.

VI-E World Model Learning

Refer to caption

Figure 11: Ablation Study of Action-conditioned World Model Learning.We compare the qualitative results of an action-conditioned world model trained on pure DROID and DROID-RoboVerse datasets, with evaluations sampled from the DROID dataset.

Recent advances in general-purpose video generation and interactive world models [113, 6] have shown promising progress. Yet, the scarcity of gigantic-scale robotic datasets still impedes the development of robust world models for a wide range of robotic applications. In this session, we demonstrate how synthetic data from the RoboVerse simulation can augment real-world datasets to train more capable robotics world models.

When a model is trained exclusively on 50,000 episodes from the DROID dataset [54], it generally respects action conditions but struggles to accurately capture physical interactions between the gripper and target objects. Notably, the objects appear “warped” during contact with the gripper, as shown in Fig. 11. By incorporating an additional 50,000 synthetic episodes from RoboVerse to create a combined dataset of 100,000 episodes, the model predictions improve with regard to preserving object geometry. However, merely “watching videos” remains insufficient for learning the intricate physical interactions in DROID.

In contrast, training solely on the RoboVerse-50K or on the DROID-RoboVerse-100K dataset and then validating on RoboVerse samples, we observe that the generated frames are physically more realistic in most scenes, with details in the supplementary materials. This improvement can be attributed to the extensive randomization and augmentation available in RoboVerse. Conversely, a model trained solely on DROID data fails to transfer effectively to the RoboVerse scene. We hypothesize that this shortcoming stems from limited samples per scene coverage in DROID and incomplete gripper visibility in the camera view.

VI-F Imitating the RoboVerse Dataset Enables Direct Sim-to-Real Transfer

Refer to caption

Figure 12: Sim-to-Real and Sim-to-Sim-to-Real Experiment Results. We demonstrate that learning within the RoboVerse framework enables seamless direct Sim-to-Real transfer for manipulating unseen objects in new environments (imitation learning) and Sim-to-Sim-to-Real transfer for whole-body humanoid control (reinforcement learning).

Refer to caption

Figure 13: Generalization of Sim-to-Sim-to-Real. This figure shows the in-the-wild generalization ability of our lower-body RL policy with upper-body PD control by the sim-to-sim-to-real approach.

The RoboVerse system seamlessly integrates a powerful physics engine with a high-quality renderer, ensuring the generation of realistic, high-fidelity data. To demonstrate its potential, we conduct experiments validating its effectiveness in direct sim-to-real transfer. As shown in Fig. 12, we fine-tune OpenVLA [56] on the RoboVerse dataset and transfer the learned policy to real-world scenarios without additional fine-tuning. The model successfully manipulates unseen objects in previously unseen real-world environments, showcasing the robustness and generalization capabilities of our system. The quantitative results on more challenging language-guided tasks, as shown in Tab. V, further demonstrate the high success rate of models trained on the RoboVerse dataset. Additional details are provided in the supplementary materials.

TABLE V: Direct Sim-to-Real. We fine-tune two baseline models using demonstrations adapted from GraspNet [27] to validate the effectiveness of the RoboVerse dataset. The final performance score for each task is reported, where a baseline receives 1 point for successfully grasping the target. Additionally, we adopt the partial reward scheme from OpenVLA [56], awarding 0.5 points when the gripper makes contact with the target.

GraspNet Objects Pick up Wash Soap Lift Mouth Rinse Grasp Green Dish
Octo [89] 5.0/10.0 3.0/10.0 6.0/10.0
OpenVLA [56] 7.0/10.0 8.0/10.0 5.0/10.0

VI-G Reinforcement Learning in RoboVerse Enables Sim-to-Sim-to-Real Transfer

Large-scale parallel environments offer significant potential for large-scale exploration and are highly effective for reinforcement learning (RL) tasks. However, while they provide excellent efficiency, their accuracy may be limited in certain scenarios [25]. To address this problem, Sim-to-sim evaluation and fine-tuning present promising solutions [66]. As shown in Fig. 13, RoboVerse platform seamlessly supports such functionalities, enabling robust sim-to-sim and sim-to-real transitions. We further demonstrate the effectiveness of sim-to-sim-to-real generalization through comprehensive experiments, highlighting the platform’s ability to bridge simulation and real-world performance.

VII Limitations

While RoboVerse provides a comprehensive and scalable platform, several limitations remain. First, the integration of a unified format for non-rigid objects is not yet fully supported, which we leave for future work to develop. Additionally, while our large-scale dataset presents significant potential for pretraining a foundation model, this exploration falls beyond the scope of this paper due to resource constraints. Furthermore, despite our extensive efforts to fully reimplement and optimize all baseline methods within the RoboVerse baselines, some implementations may still be suboptimal. Our primary goal is not to directly compare policy performance but to demonstrate that the system is comprehensive, supports diverse policies, and ensures strong alignment between simulation and real-world performance. While we have made every effort to build a robust platform, it is inevitable that some oversights or errors may remain. We encourage the broader research community to contribute to maintaining and refining the baselines, fostering collaboration to further enhance the platform’s capabilities.

Acknowledgement

We thank Hanyang Zhou and Sicheng He for providing valuable suggestions for setting up robotics hardware. We thank Yufeng Chi and Sophia Shao for providing humanoid robots for testing. We thank Jie Yang and Muzhi Han for valuable discussion. We thank Koushil Sreenath for insightful feedback. We thank Jiawei Yang, Sumeet Batra, and Gaurav Sukhatme for their generous help. Pieter Abbeel holds concurrent appointments as a professor at UC Berkeley and as an Amazon Scholar. This paper describes work performed at UC Berkeley and is not associated with Amazon.

References

Contents
  1. I Introduction
  2. II Related Work
    1. II-A Robotics Simulators
    2. II-B Large-Scale Robotics Dataset
    3. II-C Benchmarking in Robotics
  3. III Infrastructure: MetaSim
    1. III-A MetaSim Overview
    2. III-B MetaSim Implementation
      1. III-B1 Universal Configuration System
      2. III-B2 Aligned Simulator Backends
      3. III-B3 User-Friendly Environment Wrapper
    3. III-C MetaSim Capabilities
      1. III-C1 Cross-Simulator Integration
      2. III-C2 Hybrid Simulation
      3. III-C3 Cross-Embodiment Transfer
  4. IV RoboVerse Dataset
    1. IV-A Dataset Overview
    2. IV-B Tasks, Assets and Trajectories Collection: Migration
    3. IV-C Tasks, Assets and Trajectories Collection: Teleoperation and Generation
    4. IV-D Data Augmentation
      1. IV-D1 Trajectory Augmentation
      2. IV-D2 Domain Randomization
    5. IV-E RoboVerse Dataset
      1. IV-E1 Dataset Statistics
  5. V RoboVerse Benchmark
    1. V-A Benchmark Overview
    2. V-B Imitation Learning Benchmark
    3. V-C Reinforcement Learning Benchmark
  6. VI Experimental Results
    1. VI-A Overview
    2. VI-B Results on the Imitation Learning Benchmark
      1. VI-B1 Baseline and Task Selection
      2. VI-B2 Implementation Details
      3. VI-B3 Experiment Results
    3. VI-C Results on the Reinforcement Learning Benchmark
    4. VI-D Augmentation Experiments
    5. VI-E World Model Learning
    6. VI-F Imitating the RoboVerse Dataset Enables Direct Sim-to-Real Transfer
    7. VI-G Reinforcement Learning in RoboVerse Enables Sim-to-Sim-to-Real Transfer
  7. VII Limitations
  8. VIII Simulators Overview
  9. IX The MetaSim Framework
    1. IX-A Architecture Overview
    2. IX-B MetaConfig Configuration System
    3. IX-C Aligned Simulation APIs
    4. IX-D Gym API Wrappers
    5. IX-E Backend Support
      1. IX-E1 IsaacSim
      2. IX-E2 IsaacGym
      3. IX-E3 MuJoCo
      4. IX-E4 Genesis
      5. IX-E5 SAPIEN
      6. IX-E6 PyBullet
    6. IX-F Hybrid Simulation Implementation
  10. X Asset Conversion
  11. X-A Asset types
  12. X-B Conversion Pipeline
    1. X-B1 MJCF to URDF conversion
    2. X-B2 URDF to USD conversion
  13. XI Task and Data Migration
  14. XI-A ManiSkill
  15. XI-B RLBench
  16. XI-C CALVIN
  17. XI-D Meta-World
  18. XI-E Open6DOR
  19. XI-F ARNOLD
  20. XI-G robosuite & MimicGen
  21. XI-H SimplerEnv
  22. XI-I GAPartNet
  23. XI-J GAPartManip
  24. XI-K GraspNet-1B
  25. XI-L GarmentLab
  26. XI-M UniDoorManip
  27. XI-N RLAfford
  28. XI-O LIBERO
  29. XII Task Generation
  30. XII-A Robot & Object Generation Protocol
  31. XIII Teleoperation
  32. XIII-A Keyboard
  33. XIII-B Smartphone
  34. XIII-C Others
  35. XIV Real2Sim Toolset for Asset and Task Generation
  36. XIV-A Overview
  37. XIV-B Components
    1. XIV-B1 Gaussian Splatting Reconstruction
    2. XIV-B2 Mesh Reconstruction
    3. XIV-B3 Loading the URDF into the Simulation Environment
    4. XIV-B4 Real-to-Sim boost Sim-to-Real Performance
  38. XIV-C Limitations and Challenges.
  39. XV Domain Randomization
  40. XV-A Scene Randomization
  41. XV-B Visual Material Randomization
  42. XV-C Light Randomization
  43. XV-D Camera Randomization
  44. XVI Navigation and Locomotoin Tasks
  45. XVI-A Navigation Tasks
  46. XVI-B Humanoid Tasks
  47. XVI-C HumanoidBench
  48. XVII RoboVerse Benchmark Set up Details
  49. XVII-A Generalization Levels
  50. XVII-B RoboVerse Benchmark Protocol
  51. XVIII Policy Training Details
  52. XVIII-A Implementation Details
  53. XVIII-B Diffusion Policy
  54. XIX World Model Details
  55. XIX-A Methodology
  56. XIX-B Data Preparation
  57. XIX-C Experiments

VIII Simulators Overview

In the field of robotics, simulators play an important role. It is the womb of a robot, taking responsibility for training and testing a robot’s behaviors before it was "born" into the real world. Therefore, the functionalities are crucial for a successful robotic application. Users require different functions of simulators according to their specific scenarios: whether it is a photorealistic task which requires accurate rendering of a close-to-real virtual world, or a massive parallel scene that is designed for efficient reinforcement learning. All the requirements may influence the choice of the simulator. In order to reduce the pain users need to endure in getting them familiarized with each new simulator, we incorporated multiple simulators into the RoboVerse platform and listed specifications of the simulators currently supported by RoboVerse in Fig. VI.

Simulator Physics Engine Rendering Sensor Support Dynamics GPU Open
SAPIEN [125] PhysX-5, Warp Rasterization RayTracing RGBD; Force; Contact Rigid; Soft; Fluid
PyBullet [16] Bullet Rasterization RGBD; Force IMU; Tactile Rigid; Soft; Cloth
MuJoCo [114] MuJoCo Rasterization RGBD; Force IMU; Tactile Rigid;Soft;Cloth
CoppeliaSim [101] MuJoCo; Bullet ODE; Newton; Vortex Rasterization RGBD; Force; Contact Rigid;Soft;Cloth
Isaac Sim [88] PhysX-5 RayTracing RGBD; Lidar; Force Effort; IMU; Contact Proximity Rigid; Soft Cloth; Fluid
Isaac Gym [75] PhysX-5, Flex Rasterization RGBD; Force; Contact Rigid; Soft; Cloth
Genesis [2] Genesis Rasterization RayTracing RGBD; Force; Tactile Rigid; Soft

TABLE VI: Comparison of Physics Simulators [108]. The column GPU denotes whether the simulator can use GPU-accelerated computation. The column Open denotes whether the simulator is open-source.

Due to the complexity of physics simulation and rendering, current simulators cannot depict the real world well enough. Our experiments revealed some common issues of nowadays simulators in the basic physics laws. The experimental results on fundamental conservation laws may be a pessimistic sign on our hope of direct sim-to-real transfer of more complicated robotic behaviors.

We conducted experiments on three basic conservation laws of physics in three simulators.

In the experiments for Conservation of Momentum, two rigid bodies are placed in a gravity-free environment, their initial states are set to have an elastic collision.

In the experiments for Conservation of Angular Momentum, one or two rigid bodies are placed in the gravity-free environment, and their initial states are set to rotate. We calculate and record the overall angular momentum as the system evolves.

In the experiments for Conservation of Kinetic Energy, two rigid bodies are placed in the gravity-free environment, and their initial states are set to have a rotation-free elastic collision. This setup allows us to directly observe the conservation of kinetic energy regardless of the results of experiments on angular momentum.

From the results listed in Fig. 14, we can easily notice that basic conservation laws are not kept in the three simulators. However, different simulators behave differently in different experimental setups, which indicates that depending on the needs of different tasks, we may need to choose different simulators for more accurate results. This highlights the necessity of a tool that helps users to easily transfer tasks among simulators.

Refer to caption

(a) Momentum

Refer to caption

(b) Angular

Refer to caption

(c) Kinetic Energy

Figure 14: Three series of experiments on conservation laws in simulators. Blue, orange and green lines are data collected from SAPIEN, Isaac Gym and PyBullet respectively.

IX The MetaSim Framework

IX-A Architecture Overview

The MetaSim framework is a unified simulation framework as shown in Fig. 15. On the front-end side, it provides user-friendly Gym APIs as well as easy-to-use parallel environment support. On the back-end side, it supports multiple simulators to allow seamless transfer of tasks across simulators. Users only need to master simple skills on writing a simulator-agnostic MetaConfig configuration class, the environment will then be automatically instantiated with the designated back-end simulator.

Refer to caption

Refer to caption

Figure 15: Comparison between the MetaSim and the other simulation environments. Left: Other simulator and benchmark, using self-defined data format, simulator-associated assets, simulator-dependent task definition, and scripts. Right: The MetaSim, decoupling all components to be agnostic to specific simulators or benchmark environments.

IX-B MetaConfig Configuration System

The MetaSim framework uses MetaConfig, a unified configuration class to describe a scenario in simulation environments.

We designed a configuration system that set up the simulator, define the tasks, set up the domain randomization. In order to run the same setting of environments across different simulators, the configuration system is defined to be simulator-agnostic as much as possible. For simulator-specific settings (e.g. rendering mode, physics engine solver type, etc.), there is a seperate simulator-specific part which defines those things.

To make changing the settings and debug more easily, we design the configuration system in a Hydra [128]-like way, making each item in the configuration system can be modified from commandline just like Hydra [128]. The configuration system is implemented based on Python dataclass, and could therefore use Python type annotation to help user use them.

In order to run the tasks seamlessly across all simulators, it is necessary to define them in a simulator-agnostic way. We configure the task and define its objects list, robot in use, success checker and the reward. The success checker is used to determine when the task is successfully execucated, and is the most difficult part in task definition. To standardize, we offer some structured success checker templates which cover the most cases, and leave option for users to define a callback function for flexibility to implement those stuctured success checker could not cover.

IX-C Aligned Simulation APIs

MetaSim support different simulator backends, including Isaac Sim [88], Isaac Gym [75], MuJoCo [114], PyBullet [16], SAPIEN [125], CoppeliaSim [101, 47]. The framework is implemented in Python, as these simulators either natively support Python or provide Python APIs.

Common simulator operations are unified in a Handler class. Each handler supports only tree basic APIs: get_state(), set_state() and step(). The get_state() method takes a descriptive Python dict (e.g., {object_name: {’pos’: ..., ’rot’: ..., ’...’: ...}}) as input, and returns current simulation states according to the dict in another Python dict structured in the same manner. The set_state() method also takes a descriptive Python dict as input, and modifies current simulation states to the ones included in the dict. The step() method will prompt the simulation to proceed one timestep.

IX-D Gym API Wrappers

To support building learning environments, we define an Env class built on top of Handler. It offers Gymnasium-like APIs (step, reset, render, and close), implementing these methods by leveraging the underlying Handler methods.

It is worth noting that most simulation environments provide the underlying APIs (corresponding to our Handler) and upper-level environments (corresponding to our Env) seperately, such as SAPIEN [125] with ManiSkill [112], Isaac Sim [88] with IsaacLab [83], CoppeliaSim [101]/PyRep [47] with RLBench [48], and MuJoCo [114] with MuJoCo Playground [135]. This fact proves our Handler and Env two-level abstraction reasonable.

IX-E Backend Support

IX-E1 IsaacSim

Isaac Sim [88] is an advanced robotics simulation platform developed by NVIDIA. By leveraging high-fidelity physics, GPU acceleration, and photorealistic rendering, it enables rapid prototyping, testing, and deployment of AI-driven robotics solutions in virtual environments. Through seamless integration with NVIDIA’s Omniverse framework, Isaac Sim [88] offers robust features such as domain randomization, sensor simulation, and support for large-scale reinforcement learning, making it a powerful tool for both research and industrial applications.

IX-E2 IsaacGym

Isaac Gym [75] is a physics simulation environment designed for reinforcement learning research. Although it remains available for download, official support has ended. Nevertheless, multiple works published before 2024—such as hora [94], humanoid-gym [38], and IPC-graspsim [55]—were developed using Isaac Gym.

Key features of Isaac Gym include support for importing URDF and MJCF files with automatic convex decomposition, a GPU-accelerated tensor API for managing environment states and actions, and a range of sensors (e.g., position, velocity, force, torque). Additional capabilities include runtime domain randomization of physics parameters, Jacobian and inverse kinematics support, and customizable friction settings.

IX-E3 MuJoCo

MuJoCo [114] is a physics engine and simulation framework designed to accurately model the dynamics and control of complex robotic systems in real-time. Its name, MuJoCo, stands for Multi-Joint dynamics with Contact, highlighting its primary emphasis on efficient computation of contact forces and multi-joint dynamics. The engine supports advanced features such as frictional contact models, user-defined actuators, and customizable sensor modalities, allowing researchers and developers to prototype, test, and refine control algorithms across a wide range of robot morphologies and tasks.

A key strength of MuJoCo is its computational precision, which enables high simulation throughput and real-time interactive control. It supports rigid-body dynamics, articulated mechanisms, and a variety of constraints, making it suitable for tasks involving locomotion, manipulation, and reinforcement learning. Furthermore, MuJoCo’s flexible XML-based model description streamlines creating and modifying simulated environments, providing a straightforward way to experiment with novel designs. The compatibility between MuJoCo and Brax offers a high-speed, differentiable pipeline crucial for reinforcement learning. This powerful blend of accuracy, speed, and flexibility has solidified MuJoCo’s status as a leading choice in robotics research and machine learning, particularly for advanced control, motion planning, and reinforcement learning applications [29].

IX-E4 Genesis

Genesis [2] is a comprehensive physics platform developed for robotics and physics simulation research, unifying multiple core capabilities in a single environment. At its foundation is a universal physics engine, rebuilt from the ground up to simulate diverse materials and physical phenomena while seamlessly integrating various solvers. Alongside this engine, Genesis provides a swift, Python-friendly robotics simulation toolkit, an efficient photo-realistic rendering system, and a data-generation module that converts natural language prompts into multi-modal datasets. We leverage the Genesis backend to support loading, simulation, and rendering in RoboVerse workflow.

IX-E5 SAPIEN

SAPIEN [125] is a robot simulation framework that allows highly efficient simulation and rendering of robotic tasks. It uses PhysX [86] as the underlying physics engine. We supported the released version Sapien 2.2 for the MetaSim framework.

We use the multipocessing library to support parallel environments in the Handler class for Sapien. When instantiating the environment from configurations, a desired number of processes are forked to run the simulation of different environments. To support the get_states and set_states API, data for different environments are distributed to different processes, and the return values are then gathered.

IX-E6 PyBullet

PyBullet [17] is a fast and easy-to-use robotics simulator. It uses its own physics solvers for accurate and efficient simulations. We supported the released version PyBullet 3.2 for the MetaSim framework.

We use the same techniques as for Sapien to achieve parallel-environment simulation.

IX-F Hybrid Simulation Implementation

MetaSim allows launching two simulators in one single process with one command. Taking our demo collection command as example: python collect_demo.py ---sim=mujoco ---renderer=isaaclab ---task=$task. The implementation is illustrated in Code IX-F.

{lstfloat}

class HybridEnv:

def __init__(self, env_physic: Env, env_render: Env):

def step(self, action):

env_physic.handler.set_states(action=action)

phys_states = env_physic.handler.get_states()

env_render.handler.set_states(states=phys_states)

env_render.handler.refresh_render()

states = env_render.handler.get_states()

return …

Pseudocode for implementing hybrid simulation using two different simulator environments simultaneously. The core of this implementation is using states as a unified representation across both simulation environments.

X Asset Conversion

X-A Asset types

The diverse landscape of robotic assets, stemming from prior research initiatives [145, 48, 84] and a multitude of software platforms [114, 75, 125], necessitates a robust strategy for managing a wide array of file formats. To facilitate dependable cross-simulator training and uphold data integrity throughout the development lifecycle, the establishment of an efficient and reliable asset conversion pipeline is of paramount importance [26]. Such a pipeline is crucial for ensuring seamless interoperability, minimizing potential data loss or inaccuracies, and promoting the uniform application of metadata and configurations across disparate simulation environments. A selection of frequently encountered asset formats includes, but is not limited to, MuJoCo XML control files [114], URDF files [8], and USD files [88].

The three predominant file formats in robotics simulation: MJCF, URDF, and USD. Each of them serves distinct purposes and offers unique capabilities. MJCF (MuJoCo Configuration Format) stands out for its exceptional expressiveness in physics simulation, featuring sophisticated capabilities to model complex dynamical systems including tendons, actuators, and advanced joint configurations, along with an integrated compiler for handling complex compile-time computations [114]. URDF (Unified Robot Description Format), while more constrained in its feature set, has emerged as the de facto standard in robotics due to its remarkable cross-platform compatibility and universal adaptability across various simulation environments including Isaac Sim [88], Isaac Gym [75], MuJoCo [114], Gazebo, and PyBullet [16], making it ideal for robot model exchange despite its limitations in representing parallel mechanisms or complex sensor configurations [8]. USD (Universal Scene Description), originally developed by Pixar Animation Studios, excels in high-fidelity rendering and scene composition through its sophisticated layering system and variant sets [22], making it particularly valuable for applications requiring advanced visual properties and collaborative workflows [87], although its physics simulation capabilities are more limited compared to dedicated robotics formats like MJCF [26].

TABLE VII: Comparison of Robot Description Formats

Features MJCF URDF USD
Basic Geometries
Mesh Support
Texture Support Limited
Material Properties Basic
Physics Properties Limited
Joint Types Many Basic Basic
Collision Properties Advanced Basic Advanced
Deformable Objects
Animation Support Limited
Scene Composition Basic Advanced
File Format XML XML ASCII/Binary

X-B Conversion Pipeline

Given that our simulation pipeline primarily utilizes Isaac Sim for rendering while many of our assets are originally stored in MJCF format, a two-stage conversion pipeline (MJCF → URDF → USD) becomes necessary and advantageous. This approach leverages URDF as an intermediate format for several reasons. First, while direct conversion from MJCF to USD is theoretically possible, such conversion would be complex and error-prone due to MJCF’s rich feature set for physics properties (like tendons and actuators) that lack direct equivalents in USD [118]. Instead, converting to URDF first allows us to standardize the robot’s basic kinematic and dynamic properties in a format that has well-established conversion tools and widespread support. The subsequent URDF to USD conversion benefits from Isaac Sim’s robust URDF importing capabilities, which have been extensively tested and optimized for robotics applications. This two-stage pipeline thus ensures more reliable asset conversion while maintaining essential physical properties and compatibility across different simulation environments.

X-B1 MJCF to URDF conversion

We implemented our own MJCF to URDF converter by first parsing everything with MuJoCo’s MJCF importer, then exporting all texture, collision mesh and joint information to the correct URDF format. The inspiration is taken from Genesis [2], which they built their own class for each asset object that encode all joint, texture and mesh information. We then recursively generate the body information to URDF and align everything with texture.

To parse link, joint, and body information from the MJCF file, we leverage MuJoCo’s parsing capabilities to load the MJCF XML into a MuJoCo model structure. From this parsed model, we employ a recursive approach, starting from the root body and descending into each child body to systematically process the hierarchical structure. For each body, we extract detailed link properties such as name, position, orientation, inertial characteristics, and associated geometry. Simultaneously, we parse joint information connected to each body, including joint type, limits, and axis of motion. All of this extracted link and joint data is systematically organized and stored in dictionary structures. These dictionaries serve as intermediate representations, holding all the necessary information from the MJCF model in a structured format that is readily accessible for subsequent stages of the URDF conversion process.

Aligning Meshes and textures

The management of collision meshes across existing asset libraries presents a notable challenge, as these assets are typically stored in various formats including .msh, .obj, and .stl files. While URDF natively supports .obj and .stl formats, the conversion of .msh files into URDF-compatible formats requires careful consideration. Although MuJoCo’s repository provides a conversion utility for transforming .msh files to .obj format—accomplished by parsing the .msh files through the MuJoCo interface and subsequently exporting vertex and face information—this approach introduces potential complications with texture mapping alignment.

The complexity arises from the specific requirements of texture files, which are predominantly stored as albedo PNG files. These textures depend on precise UV mapping coordinates within the .obj file to ensure proper alignment. The current .msh to .obj conversion utility provided in the MuJoCo repository does not adequately address texture support, leading to potential misalignment issues in the converted models. This limitation is particularly evident in comprehensive robotics frameworks such as LIBERO [68] , where both static and articulated objects frequently exhibit texture alignment discrepancies following the .msh to .obj conversion process.

Fortunately, we discovered that many asset collections maintain redundant mesh representations, often including a properly UV-mapped .obj file alongside the .msh file, typically sharing the same filename or designated as "textured.obj". Leveraging this observation, we implemented a robust mesh alignment pipeline that follows a hierarchical decision process:

Following the mesh format resolution, the pipeline systematically maps these processed mesh files back to their corresponding links within the URDF structure, maintaining the integrity of the robot’s geometric representation while preserving texture information where possible.

Building URDF

The assembling procedure after all the conversions become very aparent: we first processes robot links and joints, incorporating their properties and relationships into the URDF format. This automated approach ensures a robust and flexible method for generating URDF files, accommodating a wide range of robot configurations and properties derived from the preceding conversion steps.

Even though this pipeline roughly works for most of the MJCF, for some specific MJCF files in some specific folder, we have to modify our conversion approach on a case by case basis. Below is a table for some special treament we employed to specific packages, and its conversion success rate:

Despite the general efficacy of the described pipeline across a broad spectrum of MJCF assets, it is important to acknowledge that certain MJCF files, particularly those within specific packages or directories, necessitate bespoke conversion strategies. These exceptions arise due to the inherent complexity and variability in MJCF file structures across different projects and asset libraries. To address these unique cases, we have adopted a tailored approach, implementing case-specific modifications to our conversion pipeline as required. The subsequent table details instances where such specialized treatment has been applied, along with the corresponding conversion success rates achieved for each package.

X-B2 URDF to USD conversion

Isaac Sim has implemented a robust solution for converting URDF files to USD format. The conversion process comprehensively preserves the robot’s structural and kinematic information, including joint hierarchies, geometric properties, and physical attributes. The implementation demonstrates exceptional fidelity in translating complex robotic descriptions, ensuring that all essential components—such as joint configurations, collision geometries, and visual representations—are accurately encoded in the resulting USD files.

Given the proprietary nature of Isaac Sim’s conversion implementation, we utilize their framework as an external tool in our pipeline. This approach leverages the proven reliability and performance of Isaac Sim’s converter while maintaining compatibility with our broader system architecture. The conversion process serves as a critical bridge between standard robotics formats and the high-performance USD representation required for our simulation environment.

XI Task and Data Migration

XI-A ManiSkill

ManiSkill [84, 37, 112] provides a series of robotic manipulation tasks under single-arm or dual-arm settings.

Tasks and assets

We migrate basic single-arm tasks and demonstrations to RoboVerse, including the pick-and-place tasks like PickCube and PickSingleYCB, as well as the insertion tasks like PegInsertionSide and PlugCharger. The corresponding assets are manually crafted with primitives or process from the mesh files, with proper physics API set up.

Demonstrations

For each task, a great number of demonstration trajectories are available in the released data. Noteworthy, the data does not come with the initial scene states, which are obtained by replaying the demonstrations within the SAPIEN simulator. With the specified seed set, the states are recovered by the random samplers.The success checkers are implemented according to the task designs.

XI-B RLBench

RLBench [48] is a large-scale benchmark and learning environment for robotic manipulation, featuring 100100100100 diverse, hand-designed tasks ranging in complexity, from simple actions like reaching to multi-stage tasks like opening an oven and placing a tray inside. Each task includes an infinite supply of demonstrations generated via waypoint-based motion planning.

Tasks and assets

We roll out ∼2⁢Ksimilar-toabsent2𝐾{\sim}2K∼ 2 italic_K trajectories in RLBench [48] for each task, and migrate them to RoboVerse.

XI-C CALVIN

CALVIN [82] provides 6-hour teleopreation trajectories on 4 environments, each involve an articulated table with three blocks in blue, pink, or red.

Tasks and assets

We migrate the demonstrations in all 4 environments and transform the original assets (URDF for the table, and primitives for the cubes) into USD files with proper physics APIs.

Demonstrations

We segment the trajectories according to the text annotations, which specified the task category (e.g., PlaceInSlider), the text annotation (e.g., place the red block in the slider), and the timestamps of the demonstration segment. The states of the first frame is adopted as the scene initial states.

Success checkers

We carefully implement the success checkers according to the original implementation to make sure the failed executions can be filtered out. This is because the coarsely annotated timestamps in the dataset, which may cause the failed execution in part of the demonstrations.

XI-D Meta-World

Meta-World [134] is a widely used benchmark for multi-task and meta-reinforcement learning, comprising 50 distinct tabletop robotic manipulation tasks involving a Sawyer robot.

Tasks and Assets

We integrate five representative tasks into RoboVerse: DrawerOpen, DrawerClose, DoorClose, WindowOpen, and WindowClose. The corresponding assets are manually converted from MJCF to USD files with appropriate physics APIs.

Demonstrations: As the benchmark does not provide demonstrations, we generate trajectories for each task by rolling out reinforcement learning policies from [126].

XI-E Open6DOR

Open6DOR is a benchmark for open-instruction 6-DoF object rearrangement tasks, which requires embodied agents to move the target objects according to open instructions that specify its 6-DoF pose.

Tasks and Assets

The synthetic object dataset comprises 200+ items spanning 70+ distinct categories. Originally derived from YCB [7] and Objaverse-XL [22], the objects are carefully filtered and scaled using a standardized format of mesh representation. Overall, the Open6DOR Benchmark consists of 5k+ tasks, divided into the position-track, rotation-track, and 6-DoF-track, each providing manually configured tasks along with comprehensive and quantitative 3D annotations.

Success checkers

We determine success by comparing the target object’s final pose with the annotated ground-truth pose range.

XI-F ARNOLD

ARNOLD [36] is a benchmark for language-conditioned manipulation. The benchmark uses motion planning and keypoints for robot manipulation tasks, focusing on fine-grained language understanding.

Tasks and Assets

: We integrate six out of eight tasks from ARNOLD [36] into RoboVerse: picking up objects, reorienting objects, opening/closing drawers, and opening/closing cabinets.

Demonstrations: As the benchmark does not use trajectory-level demonstrations, we use motion planning for trajectory generation to interpolate between keypoints

XI-G robosuite & MimicGen

robosuite [145] provides a suite of task environments for robotic manipulation, built on the MuJoCo physics engine. Each task is implemented as a separate class, with most configuration details embedded in the source code. Based on these environments, MimicGen [79] offers thousands of demonstrations, serving as a widely used benchmark for imitation learning.

Tasks and Assets

For tasks with separate object description files (MJCF), we directly migrate the corresponding assets through our Asset Conversion pipeline. However, some tasks contain hard-coded assets within the source code, such as a hammer composed of multiple cubes, cylinders and other primitives with carefully designed relative poses. To integrate these tasks, we will manually reconstruct the assets within our framework. We also argue that hard-coded asset and task definitions, as opposed to modular task descriptions, are not scalable for future robotic task benchmarking.

Demonstrations

We convert MimicGen demonstrations into our format. Specifically, we transform the robot actions from 6-DoF Cartesian space representations to joint space. Additionally, the state of the first frame is adopted as the initial scene state.

Success Checkers

We meticulously implement success checkers based on the original definitions to ensure failed executions are effectively filtered out.

XI-H SimplerEnv

SimplerEnv is a set of tasks and methods designed to do trustworthy benchmarking in simulation for manipulation policies that can reflect the real-world success rate.

There are in total 25252525 different tasks in SimplerEnv. We ignore all tasks that are just a subset of another task and migrated in total 6666 tasks and 52525252 object assets to RoboVerse. The tasks all use Google Robot.

SimplerEnv provided some controller models trained with RT-1 [4] and RT-X [15] dataset. We did not use the trajectories from the dataset directly because some environmental settings are different from the environments from SimplerEnv. We used the trained model to collect trajectories. Hooks are inserted into the original SimplerEnv codebase to extract and maintain the recordings at different stages of simulation. We then rollout the model trained with RT-1 dataset on each task to collect the trajectories.

XI-I GAPartNet

For tasks in GAPartNet [34], we generate both motion planning [34] and reinforcement learning [32] trajectories. GAPartNet is implemented in Isaac Gym [75] with various articulated objects. To integrate it into RoboVerse, we first align all articulated object initial states to the MetaSim format and convert the asset format to USD for compatibility across different simulators.

For trajectory generation:

(1) Motion Planning: GAPartNet [34] introduces a part-centric manipulation approach. We roll out heuristics to generate manipulation trajectories, providing three demonstrations per part with different object and part initial states. (2) Reinforcement Learning Rollout: The follow-up work, PartManip [32], proposes several reinforcement learning methods. We re-train all policies based on our robot setup and roll out trajectories for dataset collection. With aligned task configurations, trajectories, and assets, we successfully adapt GAPartNet into RoboVerse.

XI-J GAPartManip

Instead of providing direct demonstrations, GAPartManip [18] offers a large-scale, part-oriented, scene-level dataset with annotations for actionable interaction poses. We utilize the mesh-level grasping pose annotations in this dataset to generate diverse demonstrations for articulated object manipulation.

Tasks and Assets

We currently implement two tasks: OpenBox and OpenToilet. For the OpenBox task, we collect 12 object assets from the Box category in the original dataset. For the OpenToilet task, we gather 30 objects from the Toilet category. We convert these assets into USD files with appropriate physics APIs to ensure compatibility with our simulation environment.

Demonstrations

We generate demonstrations for our tasks in simulation using motion planning with cuRobo [110]. First, we filter potential grasping poses for the target object link by assessing their feasibility through motion planning. Specifically, we discard poses that the end-effector cannot reach or that would cause a collision between the robot and the object. Next, we generate an end-effector pose trajectory to complete the task using heuristics. Based on the object’s kinematic tree, we could define an ideal trajectory. We then apply motion planning to perform inverse kinematics, computing the corresponding joint poses of the robot along this trajectory. Finally, we execute the planned trajectory in simulation to verify task completion, saving successful trajectories as demonstrations. The entire demonstration generation process is conducted in Isaac Sim [88].

Success Checkers

To determine task success, we require the manipulated object to be opened by at least 60 degrees for all tasks.

XI-K GraspNet-1B

GraspNet-1B [27] is a general object grasping dataset for predicting 6 DoF grasping pose given partial pointcloud input. It contains 256 realworld tabletop scenes consists of total 88 different objects. We carefully filter out 58 objects as our target grasping objects based on the availability of purchasing real items because we need to evaluate our policies to grasp them in the real world experiments. To generate grasping demonstrations, we use cuRobo [110] as motion planner to generate robot end effector trajectories starting from a fixed initial pose and ending to an target object grasping pose. The grasping pose is obtained from the grasping annotations used to train GraspNet [27]. We also randomized the object positions to generate more diverse layouts. Finally, we validate the trajectories in our framework and filter out invalid ones by controlling robots to follow the generated grasping trajectories. In the end, we successfully generated about 100k valid grasping trajectories.

XI-L GarmentLab

GarmentLab [72] is the first robotic manipulation benchmark for deformable object and garment manipulation. It integrates 10 categories of versatile garment assets and the total number of USD assets reaches 6k. To generate manipulation demonstrations, we directly roll out the trajectories provided by the official codebase in Isaac Sim and collect the corresponding state information in a parallel process. Although the trajectory provided by the official codebase is limited and hard-coded, we further extend the number of demonstrations by applying different garments and textures, and all the demonstrations are validated by the original success checker. Finally, we have successfully collected 6k trajectories.

XI-M UniDoorManip

UniDoorManip [67] provides an articulated manipulation environment reflecting different realistic door manipulation mechanisms, and a large-scale door dataset containing 6 door categories with hundreds of door bodies and handles stored in URDF format. We convert those door assets into USD format with physics APIs from Isaac Sim and manually further verify the correctness of the joint-link relationship. Demonstrations are collected by directly rolling out the hard-coded trajectories in Isaac Gym. We eventually collect about 1k successful legal demonstrations.

XI-N RLAfford

RLAfford [35] investigates the generalization ability of Deep Reinforcement Learning models on articulated object manipulation tasks with the presence of a computer vision model that is co-trained with it in an end-to-end manner. This work provided a dataset of articulated objects and 8 tasks for benchmarking.

In RoboVerse, we have adapted 4 tasks (open cabinet, open drawer, close cabinet, close drawer) and in total 40k trajectories from RLAfford.

In the task adaptation, we included 40 articulated objects from the RLAfford dataset, and uses the same robot description file from RLAfford. Then we record 1000 trajectories for each object in its corresponding task.

The trajectory recording is achieved with several hooks we inserted into the original RLAfford codebase. The hooks are used to extract and maintain the recordings at different stages of simulation. We evaluated the released RLAfford model with hook-inserted scripts. In the initialization stage, objects and robots are initialized with randomization, their pose, and DoF information are recorded. For each simulation step, the DoF position information of objects and robots is recorded in the trajectories. In the end, for each object, a separate trajectory file of 1000 different trajectories is saved in the RoboVerse supported format.

XI-O LIBERO

LIBERO [68] manages data loading and task execution through a combination of INIT(initialization files), BDDL (Behavior Description Definition Language), and HDF5 datasets. Specifically, the initialization files define scene layouts, object properties, and basic task goals; the BDDL format captures semantic details and object affordances; and the HDF5 files store structured data such as object positions and robot actions for dynamic retrieval at runtime.

To migrate a LIBERO task into MetaSim, we parse the relevant BDDL file to identify which objects are involved and what type of manipulation context is required. Then we get the robot and object initial states from the INIT files, followed by the corresponding robot actions from the HDF5 dataset. These elements are combined into our PKL file format while also recording the participating objects in our MetaCfg. This process ensures that all necessary components of a LIBERO task, initial states, and action data, are fully translated and ready for execution in MetaSim.

We further augment the data by randomly sampling initial positions around each LIBERO demonstration, thus increasing the effective number of demos well beyond the original 50 per task. The spatial sampling range is dynamically chosen based on the task context and object dimensions, ensuring that the augmented configurations remain physically plausible.

XII Task Generation

XII-A Robot & Object Generation Protocol

Our task generation pipeline (Fig. 16) begins with a user prompt describing the desired theme or constraints of a robotic task (e.g., "place the butter in the drawer and close it"). From here, the system proceeds in two main phases, mediated by large generative model calls:

    1. call_gpt_to_generate_task(): Conceptual Task Generation.This initial function queries the model for a high-level task overview. It requests:

    • A unique task name (e.g., “ButterDrawerTask”).

    • A short, human-readable instruction (e.g., “Place the butter in the drawer, then close the drawer.”).

    • A candidate list of robots and objects to appear in the scenario, referencing an internal asset library (see below).
      The large generative model draws on its generative abilities to propose creative or contextually relevant tasks, while remaining loosely guided by the user prompt [123, 122, 39, 143]. As shown in Fig. 16, the model might retrieve a “drawer” asset from a different benchmark and a “butter” asset from a separate dataset, combining them into a single scene idea.
    1. call_gpt_to_get_init_state(): Physical Layout Refinement.After receiving the conceptual description, we call the model again to specify x,y coordinates for each listed item. During this second phase, user can provide the prompts that include minimal bounding constraints (e.g., permissible table edges, object height) to help modelgenerate various initial states by few-shot learning.

Asset Library.To ground the large generative model’s outputs in realistic data, we maintain an asset library (via JSON files) that describes each robot or object’s core attributes (e.g., assets filepath, default rotation, size). The two core functions above selectively pull from this library.

Input and Output Format.

Refer to caption

Figure 16: Illustration of the two-phase generation protocol. A user prompt guides the LLM to propose an overall task and item list. The system then refines object positions and merges them into a final initial state.

XIII Teleoperation

Refer to caption

Figure 17: Sequential demonstration of smartphone-based control for stack cube and close box tasks.

Ensuring flexible and intuitive remote operation is critical in robotic teleopration system, particularly when collecting large volumes of high quality data. In this work, we designed a suite of input methods to facilitate robot teleopration within the MetaSim infrastructure. By supporting keyboard, DualSense Joystick, smartphone, and VR-based controls, our system accommodates varying user preferences and experimental needs. This section details our design rationale, implementation steps, and practical considerations for each control interface.

XIII-A Keyboard

Keyboard input is an accessible method for controlling robots in simulation. Our implementation supports multi-key combinations for diagonal movement and enables full six-degree-of-freedom manipulation of the end effector. Translational movement follows the world coordinate frame (UP: +X, DOWN: -X, LEFT: +Y, RIGHT: -Y, ‘e’: +Z, ‘d’: -Z), while rotations in the local EE frame are controlled via ‘q’/‘w’ (roll), ‘a’/‘s’ (pitch), and ‘z’/‘x’ (yaw). The spacebar toggles the gripper. To assist users and avoid hotkey conflicts with the simulation viewer, we provide an operation window displaying instructions using pygame. While efficient and hardware-independent, this method lacks 3D spatial representation, reducing user intuition. Additionally, Euler angle-based rotation control risks gimbal lock, potentially leading to loss of rotational degrees of freedom and failure in certain configurations.

XIII-B Smartphone

Refer to caption

Figure 18: Visualization of the smartphone’s local coordinate system, world-frame orientation, and app functionality: six buttons control translation, and two switches toggle orientation control and gripper state.

Refer to caption

Figure 19: The smartphone app enables 6-DoF control using orientation sensing and multi-touch buttons for translation commands, while the simulated robot’s movements are visualized in real-time on the workstation.

Modern smartphones, equipped with advanced sensors and wireless communication, offer an ideal low-cost solution for intuitive teleoperation from any location. However, existing smartphone-based 6-DoF methods, such as those relying on accelerometers or vision-based Visual Inertial Odometry (VIO) systems (e.g., ARKit), suffer from instability due to sensor noise, low update rates, or weak visual features [40, 76, 77, 78]. Additionally, no open-source Android app exists for such implementations. To overcome these limitations, we adopt a hybrid approach: using smartphone orientation for motion control and on-screen buttons for precise translation. Unlike the keyboard interface, where roll, pitch, and yaw are controlled incrementally via discrete keypresses (i.e., delta orientation adjustments), the smartphone directly provides absolute orientation data in the form of quaternions. Quaternions, due to their compactness and immunity to gimbal lock, allow for a more stable and accurate representation of the smartphone’s orientation in the world frame. As illustrated in Fig. 18, real-time data from the smartphone’s inclination, rotation, and magnetic field sensors is fused to compute spatial orientation with ±5° accuracy at a frequency of 50 Hz. This data is transmitted via WebSocket, ensuring low-latency communication. The app interface features six buttons for translation control in the local coordinate system and two switches for toggling orientation updates and gripper control. Multi-touch input is supported to enable users to send combined control signals, such as simultaneous movement along multiple axes, improving control flexibility and efficiency. As shown in the Fig. 19 and Fig. 17, tilting the smartphone controls the gripper’s orientation, while combining multi-touch signals from on-screen buttons enables precise and complex manipulation in 3D space. However, to mitigate magnetic interference, users should maintain a minimum distance of 10 cm from strong magnetic sources such as laptops and other electronic devices. This design optimizes resource utilization, providing a high-precision 6-DoF remote operation experience at minimal cost, rivaling professional-grade teleoperation systems.

Method Simple Language-conditioned Grasping PickCube MoveSliderLeft Object Set 1 Object Set 2 Object Set 3 OpenVLA [56] 40.0 45.0 46.0 33.3 14.4 Octo [89] 50.0 30.0 42.0 14.4 2.2

TABLE VIII: Vision-Language-Action (VLA) Model Results on RoboVerse Imitation Learning Benchmark. Constrained with time and resources, we report VLA models’ results on two simple tasks from RoboVerse and grasping tasks with diverse and challenging language instructions. We split 58 objects in GraspNet into three sets, each containing progressively more challenging objects based on their geometry.

XIII-C Others

Beyond keyboard and smartphone controls, our system incorporates support for DualSense Joysticks and VR controllers. The DualSense joystick provides ergonomic advantages and high-fidelity analog inputs for nuanced velocity control, mapping triggers and joysticks seamlessly to robot motion. The VR interface enhances spatial awareness and precision by enabling natural gestures and directional cues for control.

Future work could extend VR capabilities by integrating haptic feedback to improve user immersion and task accuracy. Additionally, the modular design of our system facilitates the integration of emerging input devices with minimal development effort.

XIV Real2Sim Toolset for Asset and Task Generation

XIV-A Overview

The Real2Sim toolset, specifically Video2URDF, provides a systematic pipeline to reconstruct environment geometry and robotic assets from monocular video input. By leveraging advanced reconstruction techniques, this pipeline produces meshes and unified robot descriptions that can be used in simulation-based experiments. In doing so, it helps bridge the gap between real-world data and simulated environments, enabling more accurate and comprehensive benchmarking [71]

XIV-B Components

XIV-B1 Gaussian Splatting Reconstruction

The first step in the pipeline involves Gaussian splatting [53], which converts monocular video frames into a set of Gaussian kernels for rendering [133]. This representation captures key scene features such as depth, color, and collision boundaries in a compact and efficient way. As a result, it provides a visually faithful preview of the scene and serves as an intermediate step before detailed mesh reconstruction.

XIV-B2 Mesh Reconstruction

Once the high-level scene structure is represented by Gaussian splatting, we perform mesh reconstruction to obtain a more precise geometric model utilize tsdf extraction [136, 131, 132, 45]. This step recovers the meshes of:

We use a visual-language model (VLM) and available CAD design information to generate a unified URDF (or MJCF) description for these components. This division of the workspace follows the notion of worldconfig in cuRobo [110], ensuring that each element of the scene (robot, object, environment) is cleanly separated and can be easily adapted or replaced as needed.

XIV-B3 Loading the URDF into the Simulation Environment

After the URDF (or MJCF) files are generated, the final step is to import them into a simulator, such as MuJoCo [114] in RoboVerse. This allows researchers to configure tasks that accurately reflect real-world scenarios, forming a benchmark for training and evaluating robotic manipulation algorithms. The resulting simulated environment benefits from high-fidelity geometry and a consistent representation of the physical workspace.

XIV-B4 Real-to-Sim boost Sim-to-Real Performance

We train model on our real2sim module compared with DexGraspNet [137], demonstrating 80% success rate compared to the 50% baseline from DexGraspNet. We use our real2sim assets in physics-based simulations that closely replicate real-world grasping conditions, enabling robust grasp execution. See Fig. 20 for visualization.

Refer to caption

Figure 20: Visualization of our real2sim pipeline for robotic grasping.

XIV-C Limitations and Challenges.

While the Real2Sim pipeline effectively reconstructs most of the relevant geometry, it struggles with completely unseen meshes and complex material properties [142]. Furthermore, parameters such as friction and mass are inherently difficult to estimate purely from visual data, introducing uncertainties that may affect simulation fidelity. Despite these challenges, Real2Sim offers a powerful approach to rapidly generating simulation-ready assets for benchmarking in robotic manipulation tasks.

XV Domain Randomization

XV-A Scene Randomization

For scene randomization, we curate 3D simulatable scene assets from existing 3D scene datasets [30, 36, 21, 49]. Specifically, we convert all assets to the USD format for integration. Additionally, we employ the articulated scene generation method PhyScene [130] to create realistic scenes with articulated objects and mix the generated room-level scenes with house-level 3D scenes like ProcTHOR for greater diversity. We replay demonstrations in these scenes by selecting surfaces (e.g., floors, tables) that provide sufficient workspace, guided by heuristic-based spatial constraints, following [36].

XV-B Visual Material Randomization

It’s optinal to attach random visual material to object surfaces. Visual materials are randomly selected from a curated subset of ARNOLD [36] and vMaterials [87], providing more the 300 high-quality visual material candidates. Additionally, user can also randomize the reflection properties of a given visual material, by setting roughness, specular, and metallic to random number between 0 and 1.

XV-C Light Randomization

Two lighting configurations are supported: distant light and cylinder light arrays. For distant lighting, the polar angle of the light source is randomized. For cylinder lighting, a randomly generated n×m𝑛𝑚n\times mitalic_n × italic_m matrix of cylinder lights, each with a randomized size, is added at a fixed height above the agents. In both configurations, the intensity and color temperature of the lights are randomized within physically plausible ranges.

XV-D Camera Randomization

A total of 59 candidate camera poses are carefully selected, with the majority oriented to face the robot directly and a smaller subset positioned at side-facing angles.

XVI Navigation and Locomotoin Tasks

XVI-A Navigation Tasks

To integrate vision-and-language navigation into Isaac Sim, we first correct the error-containing instructions by refining incorrect punctuation and grammar using ChatGPT. Next, we validate the ground truth trajectory by sweeping the robot’s 3D model (based on the ground truth trajectory) through the scene. The trajectory is deemed invalid if collisions occur between the robot and the scene. Additionally, we adopt the same evaluation metrics as VLN-CE [58]. For controlling the robot, we provide two different types of mobile embodiments, including a Unitree Go2 robot dog and a JetBot wheeled robot, making our task suitable for a variety of policies (with different navigation capabilities).

Refer to caption

Figure 21: Navigation gallery. We deploy the Unitree Go2 robot within Matterport 3D environments. The robot is tasked with navigating the environment based on provided instructions.

XVI-B Humanoid Tasks

We migrated the data samples from the Humanoid-X dataset [80], and re-implemented the inference pipeline of UH-1 [80] in our framework. We use the Unitree-H1-2 humanoid robot as the simulated embodiment and set up the locomotion and humanoid pose control task in our framework. The humanoid pose control task is to control the humanoid robot to follow some human poses while maintaining its stability on the ground. The demonstrated poses in our framework include arms crossing, boxing, dancing, left and right punch, playing violin, playing guitar, praying, waving to a friend, etc. Our pretrained policy can successfully follow the demonstrated pose to control a humanoid robot while maintaining stable locomotion in IssacGym, and also obtain a decent performance in IssacLab. The humanoid environment and task configurations are highly flexible and scalable, and we are able to support more humanoid pose control tasks from Humanoid-X without modifying the infrastructure.

XVI-C HumanoidBench

HumanoidBench [106] is a high-dimensional simulated benchmark designed to accelerate research in humanoid robot learning, focusing on whole-body locomotion and manipulation tasks. The benchmark features a humanoid robot equipped with dexterous hands, enabling a wide range of complex interactions in human-like environments.

Tasks and Assets: We migrate three fundamental locomotion tasks: run, walk, and stand. These tasks are designed to test the robot’s ability to maintain balance, achieve forward motion, and stabilize in a standing position. The primary robot model used is the Unitree H1, augmented with two dexterous Shadow Hands, though the environment supports other humanoid models such as Unitree G1 and Agility Robotics Digit.

Demonstrations: While HumanoidBench does not provide pre-collected demonstrations, it supports the use of reinforcement learning algorithms to generate task-specific policies. The benchmark is designed to facilitate learning from scratch, with dense and sparse reward structures to guide the learning process.

Success Checkers: Each task in HumanoidBench is equipped with a success checker that evaluates task completion based on predefined criteria. For example, in the walk task, success is determined by the robot’s ability to maintain a forward velocity of 1 m/s without falling, while in the stand task, success is measured by the robot’s ability to maintain a stable upright posture for a specified duration.

Refer to caption

Figure 22: Learning curves of RL algorithms on HumanoidBench task migratation: We also run PPO in the Isaac Sim handler in RoboVerse, but it is not visible in the plot since it only achieves very low returns.

Refer to caption

Figure 23: Demonstration of TD-MPC2 policys trained in the RoboVerse MuJoCo simulator on the Walk and Stand tasks migrated from the HumanoidBench benchmark

Experiment and Results: We trained the walk, stand, and run tasks in both the RoboVerse MuJoCo and Isaac Sim handlers using the PPO and TD-MPC2 [41, 42] algorithms, and compared the results with the HumanoidBench baseline based on the original MuJoCo environment. As shown in Fig. 22 and Fig. 23, the training curves from the RoboVerse MuJoCo handler eventually converged and approached the performance of HumanoidBench, validating the feasibility of the RoboVerse reinforcement learning infrastructure. Additionally, we trained the same tasks in the RoboVerse Isaac Sim handler with identical configurations. While training efficiency in Isaac Sim was comparatively lower under non-parallelized settings (to maintain configuration consistency), it still demonstrated a clear upward trend in reward accumulation. This confirms the rapid migration capability of the MetaSim framework and highlights its potential to enable sim-to-sim learning while leveraging the strengths of different simulators, such as Isaac Sim’s support for GPU-accelerated large-scale parallel training.

XVII RoboVerse Benchmark Set up Details

XVII-A Generalization Levels

To systematically evaluate the generalization capability of a robot policy, we establish a benchmark based on a carefully curated asset set designed for domain randomization. This asset set encompasses a diverse range of environmental factors, including materials, textures, lighting conditions, scene configurations, and camera perspectives. By leveraging this set, we assess how well different policies generalize to unseen conditions. Specifically, we split the available assets into a 9:1 ratio for training and testing, ensuring that the testing environment contains novel variations not encountered during training. Below, we detail the key components of this domain randomization setup:

By integrating these domain randomization techniques into our benchmark, we create a controlled yet diverse testing environment that challenges the generalization ability of different robot policies. This setup ensures that trained policies are not merely overfitting to a limited set of conditions but are instead capable of adapting to a broader range of real-world variations.

XVII-B RoboVerse Benchmark Protocol

We rigorously design a training and evaluation protocol to ensure a structured and reliable assessment of the policy’s performance. Given the training data, the policy learns to imitate the demonstrated behavior. For evaluation, we provide a standardized API that enables systematic assessment. As mentioned earlier, the training and evaluation follow a 9:1 ratio, ensuring that the policy is tested on novel scenarios not encountered during training.

XVIII Policy Training Details

XVIII-A Implementation Details

For specialist models, we train from scratch with action in 9999-dim robot joint state space. Diffusion Policy [13] is implemented based on its original framework. We search several key hyperparameters, including observation and prediction length, to optimize performance for our tasks. ACT [141] is implemented with the original architecture and hyper-parameters, except that the batch size has been increased to 512512512512, with learning rate correspondingly enlarged to 1⁢e−41𝑒41e-41 italic_e - 4 to accelerate convergence. We train ACT on one A100 GPU for 2000200020002000 epochs and evaluate with the best checkpoints on the validation set.

For generalist models, the action is pre-processed into delta end-effector position space from absolute end-effector position space, and the gripper action is binarized to {0,+1}01\{0,+1\}{ 0 , + 1 }. Owing to the lack of time and resources, we are only able to fine-tune the generalist models in the single-task setting. For each task, OpenVLA [56] is LoRA [44] fine-tuned (rank=32absent32=32= 32) with 8 A100 GPU under official settings to convergence and reaches over 95% action token accuracy as proposed by [56] during the training stage. During evaluations, we employ cuRobo [110] as the inverse-kinematics solver to transform the action to robot joint state space.

XVIII-B Diffusion Policy

We implemented the training and validation code for Diffusion Policy based on the requirements of our tasks and relevant research papers.

Modeling Diffusion Policy as Denoising Diffusion Probabilistic Models (DDPMs), we train a noise predictor network:

ϵk^=ϵθ⁢(ak,s,k)^superscriptitalic-ϵ𝑘subscriptitalic-ϵ𝜃superscript𝑎𝑘𝑠𝑘\widehat{\epsilon^{k}}=\epsilon_{\theta}\left(a^{k},s,k\right)over^ start_ARG italic_ϵ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_ARG = italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_a start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_s , italic_k ) (1)

that takes in noisy actions aksuperscript𝑎𝑘a^{k}italic_a start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, current observations s𝑠sitalic_s, and denoising iterations k𝑘kitalic_k and predicts the noise ϵk^^superscriptitalic-ϵ𝑘\widehat{\epsilon^{k}}over^ start_ARG italic_ϵ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_ARG.

As for observation s𝑠sitalic_s, We use ResNet18 to extract the features of scene images fi⁢m⁢gsubscript𝑓𝑖𝑚𝑔f_{img}italic_f start_POSTSUBSCRIPT italic_i italic_m italic_g end_POSTSUBSCRIPT and use 3-layer MLP to extract the features of robot joint states fr⁢o⁢b⁢o⁢tsubscript𝑓𝑟𝑜𝑏𝑜𝑡f_{robot}italic_f start_POSTSUBSCRIPT italic_r italic_o italic_b italic_o italic_t end_POSTSUBSCRIPT. fi⁢m⁢gsubscript𝑓𝑖𝑚𝑔f_{img}italic_f start_POSTSUBSCRIPT italic_i italic_m italic_g end_POSTSUBSCRIPT concatenating with fr⁢o⁢b⁢o⁢tsubscript𝑓𝑟𝑜𝑏𝑜𝑡f_{robot}italic_f start_POSTSUBSCRIPT italic_r italic_o italic_b italic_o italic_t end_POSTSUBSCRIPT is just the conditioning input for Diffusion Policy.

During training, we randomly choose a denoising step k𝑘kitalic_k and sample noise ϵksuperscriptitalic-ϵ𝑘\epsilon^{k}italic_ϵ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT added to the unmodified sample a0superscript𝑎0a^{0}italic_a start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT. Our training loss is the difference between ϵksuperscriptitalic-ϵ𝑘\epsilon^{k}italic_ϵ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT and predicted noise:

LD⁢P=M⁢S⁢E⁢L⁢o⁢s⁢s⁢(ϵk,ϵk^)subscript𝐿𝐷𝑃𝑀𝑆𝐸𝐿𝑜𝑠𝑠superscriptitalic-ϵ𝑘^superscriptitalic-ϵ𝑘L_{DP}=MSELoss(\epsilon^{k},\widehat{\epsilon^{k}})italic_L start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT = italic_M italic_S italic_E italic_L italic_o italic_s italic_s ( italic_ϵ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , over^ start_ARG italic_ϵ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_ARG ) (2)

During inference time, our policy starts from random actions aKsuperscript𝑎𝐾a^{K}italic_a start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT and denoises for K𝐾Kitalic_K steps to obtain the final action predictions. At each step, the action is updated following:

ak−1=α⁢(ak−γ⁢ϵθ⁢(ak,s,k)+𝒩⁢(0,σ2⁢I))superscript𝑎𝑘1𝛼superscript𝑎𝑘𝛾subscriptitalic-ϵ𝜃superscript𝑎𝑘𝑠𝑘𝒩0superscript𝜎2𝐼a^{k-1}=\alpha\left(a^{k}-\gamma\epsilon_{\theta}\left(a^{k},s,k\right)+% \mathcal{N}\left(0,\sigma^{2}I\right)\right)italic_a start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT = italic_α ( italic_a start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - italic_γ italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_a start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_s , italic_k ) + caligraphic_N ( 0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I ) ) (3)

, where α𝛼\alphaitalic_α, β𝛽\betaitalic_β and γ𝛾\gammaitalic_γ are hyperparameters.

Refer to caption

Figure 24: Visualization of Sim-to-Sim-to-Real Experiments.

XIX World Model Details

XIX-A Methodology

We adopt a video generation framework based on Latte [74]—a transformer-driven latent diffusion model equipped with an efficient spatial-temporal attention mechanism. For action conditioning, we use frame-level Adaptive Layer Normalization [92] (AdaLN), following insights from IRASim [144] that show more precise control of the gripper with frame-level conditioning compared to video-level conditioning.

In the forward pass, raw video frames are encoded using a frozen autoencoder from Stable Diffusion [93]. The first frame serves as the initial condition, while noise is introduced into the latent representation of subsequent frames during training. Both the noise schedule and action conditions (gripper states with either Cartesian position plus orientation or joint position) are encoded by separate MLPs into latent space and then added together.

These noisy latent frames are then fed into a transformer composed of alternating spatial and temporal attention blocks, where action conditions are applied at each frame via AdaLN. For inference, we employ DDIM [109] as a denoising scheduler, using 200 sampling steps.

XIX-B Data Preparation

The DROID [54] dataset’s episodes typically last from 120 to 360 frames. To amplify motion, we skip every 6 frames, effectively reducing the frame rate to 4 fps with sequence lengths from 20 to 60. In the RoboVerse simulation, we adjust the control frequency so that most episodes span 20 to 60 frames, mirroring the number of frames of DROID in one episode. We filter out any sequence shorter than 20 or longer than 60 frames, resulting in about 50,000 unique episodes from DROID.

We only generate 50,000 unique RoboVerse episodes due to time and resource constraints. The full-scale RoboVerse is planned to train more capable world models in future works.

We exclude the gripper camera view because the model struggles with drastic camera pose changes, which leads to poor frame generation quality. Since we consider left and right camera views as separate samples, each dataset effectively doubles to 100,000 samples.

Refer to caption

Figure 25: Visualization of ground truth and predicted frames by models conditioned on cartesian position (plus orientation) and joint position.

XIX-C Experiments

Our experiments involve training three datasets, DROID-50K, RoboVerse-50K, and DROID-RoboVerse-100K, on 8 NVIDIA H100 GPUs. We use a spatial resolution of 240×320 and sequences of 16 frames per episode. Starting with a model of 100M parameters and a batch size of 16, training converges at around 100K steps on RoboVerse and 200K steps on DROID.

We first compare Cartesian position plus orientation to joint positions as action conditions and find that using joint positions as action conditions yields more precise gripper movement control in frame generation, as shown in Fig. 25. We believe it is due to joint positions being less ambiguous than Cartesian position plus orientation as the robot states representation.

However, generation quality remains suboptimal when training on the DROID-50K or DROID-RoboVerse-100K datasets and validating on DROID samples due to the complexity of DROID scenes. Scaling the model to 500M parameters and reducing the batch size to 8 leads to better preservation of object geometry, as does the prediction of robot arm movement.

As discussed in the main paper, although the larger model trained on DROID-RoboVerse-100K shows an improved understanding of object shapes in DROID samples compared to the model trained on DROID-50K, it still struggles with intricate real-world physics. In contrast, training with RoboVerse-50K or DROID-RoboVerse-100K and validating on RoboVerse scenes produces more physically and geometrically consistent predictions.

We believe it is because RoboVerse offers cleaner backgrounds, more comprehensive views of the robotic arm, and the implementation of domain randomization and augmentation. By comparison, many DROID frames contain cluttered backgrounds or incomplete arm visibility, creating challenges for learning robust temporal dynamics from raw pixels.