GitHub - WooooDyy/AgentGym-RL: Code and implementations for the paper "AgentGym-RL: Training LLM Agents for Long-Horizon Decision Making through Multi-Turn Reinforcement Learning" by Zhiheng Xi et al. (original) (raw)

📃 Paper • 🌐 Project Page • 🤗 AgentGym-RL-Data-ID

AgentGym-RL is a new framework to train LLM agents for multi-turn interactive decision-making through RL. It encompasses a wide variety of real-world scenarios and supports mainstream RL algorithms. Extensive experiments show that our framework and method substatially enhances the open-sourced 7B-scale model to a level that match or surpass commercial models on 27 tasks across diverse environments.

🔔 News

🌟 Overview

Developing autonomous LLM agents capable of making a series of intelligent decisioiins to solve complex, real-world tasks is a fast-evolving frontier. Merely relying on human demonstration for behaviour cloning can make agents competent for tasks, but rarely leads to genuine breakthoughs. As Richard Sutton emphasizes, it is the knowledge, skills and experience acquired through exploration and interaction with the environment that truly drives agents forward. Therefore, a promising approach is to train these agents using Reinforcement Learning.

Most existing studies remain limited to single-turn tasks like math and coding. Recent attempts to extend RL to train LLM agents with multi-turn capabilities face notable challenges:

To address these challenges, we introduce AgentGym-RL, a new framework to train LLM agents for multi-turn interactive decision-making through RL. It encompasses a wide variety of real-world scenarios and supports mainstream RL algorithms, establishing a foundation for the research and practice in the era of experience.

Furthermore, to tackle the exploration–exploitation trade-off and improve optimization stability in agent RL training, we propose ScalingInter-RL, a method that progressively extends the agent–environment interaction horizon during training. Experiments across different environments show that leveraging our AgentGym-RL framework with the ScalingInter-RL algorithm yields stable, sustained and substantial behavioral improvement.

In addition, to facilitate probing of data and model behaviors, we provide an visualized interactive user interface that allows for the replay and examination of full interaction trajectories, thereby streamlining empirical analysis for iterative development.

📖 Table of Contents

Features

Modular System Design of AgentGym-RL

We adopt a modular and decoupled design to implement AgentGym-RL, organizing it into three main components:

Environments

Post-Training Strategies

AgentGym-RL supports a suite of mainstream online RL algorithms: PPO, GRPO, RLOO, REINFORCE++.

Beyond online RL, AgentGym-RL also supports a broad range of complementary training paradigms: SFT, DPO, AgentEvol.

ScalingInter-RL: Progressive Scaling Interaction for Agent RL

ScalingInter-RL is a training approach designed to balance exploration and exploitation while ensuring stable optimization. At its core is a progressive horizon-scaling strategy that adaptively adjusts the number of interaction turns during RL.

We start training with a smaller horizon, allowing the agent to efficiently exploits its policy and gain early proficiency on simple tasks. This establishes the groundwork for deeper, long-horizon reasoning. As training progresses, we gradually extend the horizon, enabling the agent to explore longer decision paths and fostering the emergence of higher-order cognitive behaviors.

Extending Verl

We make following modifications to verl in order to develop AgentGym-RL:

  1. Rollout using vllm engine: To support multi-turn rollouts and efficent interaction with the environment, we introduce:
    • RolloutHandler to handle trajectories. We introduce RolloutHandler to correctly compute the attention masks, loss masks, position ids and sequence ids for environment observations and assistant's actions in each turn. It also handles historical messages, status and reward.
    • EnvClient to handle interactions. The EnvClient provides several methods to facilitates interactions with the environment during rollout, such as observarion() to get the currect observation from the environment, available_actions() to get the currectly available actions, step() to perform an action, and reset() to reset the environmet. To improve efficiency, our framework initializes environments and collects trajectories in parallel.
  2. Advantage computation: We revise verl's implementation of advantage computation for REINFORCE++ and GAE to ensure correctness in both single-turn and multi-turn scenarios.
  3. Scaling interaction during training: To develop ScalingInter-RL, we introduce RoundScheduler to scale interactions during training. The FixedRoundsScheduler enforces a fixed maximum number of interactions. The StepRoundsScheduler gradually increases the interaction horizon in a step-wise manner, enabling progressive scaling during training.

Performance

We leverage Qwen2.5-3B and Qwen2.5-7B as our primary backbone models. We evaludate AgentGym-RL and ScalingInter-RL across five scenarios and include multiple closed-source models and open-source models for comparison. The evaluation results on WebArena benchmark are as follows, while results on other benchmarks can be found in our paper.

Moreover, ScalingInter-RL demonstrates more stable and effcient training dynamics during RL optimization as shown in the figure below.

Running Tutorial

Environment Setup

We recommend using CUDA 12.4, PyTorch 2.4, and Python 3.10. First, install the requirements using the following command:

echo "Preparing environment for agentgym-rl..." conda create -n agentgym-rl python==3.10 -y conda activate agentgym-rl pip3 install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu124

install flash-atten

FLASH_ATTENTION_URL="https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.3/flash_attn-2.7.3+cu12torch2.4cxx11abiFALSE-cp310-cp310-linux_x86_64.whl" FLASH_ATTENTION_NAME="flash_attn-2.7.3+cu12torch2.4cxx11abiFALSE-cp310-cp310-linux_x86_64.whl" wget -q FLASHATTENTIONURL−OFLASH_ATTENTION_URL -O FLASHATTENTIONURLOFLASH_ATTENTION_NAME pip3 install $FLASH_ATTENTION_NAME rm -f $FLASH_ATTENTION_NAME

for RL

cd AgentGym-RL pip3 install -e .

for agentgym

echo "Preparing environment for agentenv..." cd AgentGym/agentenv pip3 install -e . pip3 install transformers==4.51.3

Training

For SFT, DPO and AgentEvol, please refer to the README.md of AgentGym.

For RL training:

1. Environment Setup

Make sure you have the required environments set up (see Environment Setup section above).

2. Data Preparation

Download the AgentGym-RL-Data-ID dataset from Huggingface.

3. Launch the environment server

Please launch the environment server by referring to the README.md of AgentGym.

4. Training

You can see the training example scripts for each task in the examples/train for AgentGym-RL and the ScalingInter-RL. In addition, you may refer to the training parameters configured in those scripts.

Most explanations of the arguments can be found in the docs of verl. Other key arguments:

See AgentGym-RL/verl/agent_trainer/config/ppo_trainer.yaml for more details.

To launch the AgentGym-RL training, set:

algorithm.rounds_ctrl.type=fixed
algorithm.rounds_ctrl.rounds=15 \

You can see examples/train/AgentGym-RL/webarena_train.sh as an example.

To launch the ScalingInter-RL training, set:

algorithm.rounds_ctrl.type=scaling_inter_stepwise
algorithm.rounds_ctrl.steps_scaling_inter=100
algorithm.rounds_ctrl.rounds=[10,20,30] \

You can see examples/train/ScalingInter-RL/webarena_train.sh as an example.

Evaluation

1. Environment Setup

Make sure you have the required environments set up (see Environment Setup section above).

2. Data Preparation

Download the AgentGym-RL-Data-ID dataset from Huggingface.

3. Launch the environment server

Please launch the environment server by referring to the README.md of AgentGym.

4. Evaluation

You can see the evaluation example scripts for each task in the examples/eval. In addition, you may refer to the evaluation parameters configured in those scripts.

To run the evaluation, you can see examples/eval/webarena_eval.sh as an example.

Most explanations of the arguments can be found in the docs of verl. See AgentGym-RL/verl/agent_trainer/config/generation.yaml for more details.

Visualized user interface

Check here for setup instructions.

Acknowledgement

The Training module of AgentGym-RL is built upon Verl, and the Environment module is built upon AgentGym. We are grateful for their infrastructure support. We also extend our thanks to TextCraft, BabyAI, SciWorld, WebArena, Search-R1 for their opensource.

Contact

Citation

Please cite the following paper if you find AgentGym-RL helpful!

@misc{xi2025agentgymrltrainingllmagents,
      title={AgentGym-RL: Training LLM Agents for Long-Horizon Decision Making through Multi-Turn Reinforcement Learning}, 
      author={Zhiheng Xi and Jixuan Huang and Chenyang Liao and Baodai Huang and Honglin Guo and Jiaqi Liu and Rui Zheng and Junjie Ye and Jiazheng Zhang and Wenxiang Chen and Wei He and Yiwen Ding and Guanyu Li and Zehui Chen and Zhengyin Du and Xuesong Yao and Yufei Xu and Jiecao Chen and Tao Gui and Zuxuan Wu and Qi Zhang and Xuanjing Huang and Yu-Gang Jiang},
      year={2025},
      eprint={2509.08755},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2509.08755}, 
}
@misc{xi2024agentgymevolvinglargelanguage,
      title={AgentGym: Evolving Large Language Model-based Agents across Diverse Environments}, 
      author={Zhiheng Xi and Yiwen Ding and Wenxiang Chen and Boyang Hong and Honglin Guo and Junzhe Wang and Dingwen Yang and Chenyang Liao and Xin Guo and Wei He and Songyang Gao and Lu Chen and Rui Zheng and Yicheng Zou and Tao Gui and Qi Zhang and Xipeng Qiu and Xuanjing Huang and Zuxuan Wu and Yu-Gang Jiang},
      year={2024},
      eprint={2406.04151},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2406.04151}, 
}