Tao Gui (桂韬) (original) (raw)

Tao Gui

I work at the intersection of complex reasoning and AI agents for large language models. My group builds systems that can reason, plan, self-evolve, and act in diverse environments — and we study how to align them with human values through reinforcement learning.

I co-lead the NLP-LI lab at Fudan with Qi Zhang, within the group of Xuanjing Huang. Ph.D. from Fudan (2021), B.S. from NUDT.

LLM Reasoning AI Agents RLHF & Alignment Multimodal LLM

Agents & Environments

Preprint 2026

A systematic benchmark evaluating how well LLMs learn from contextual demonstrations, covering diverse task types and context configurations.

ICLR 2026

An open-source RL framework for training agents on long-horizon tasks via multi-turn interactions, extending the AgentGym ecosystem with scalable reinforcement learning.

Preprint 2025

A comprehensive survey on memory for foundation model agents — proposing a unified taxonomy of short-term, long-term, and parametric memory, and charting the path toward persistent, capable agents.

Preprint 2025

An AI-powered academic search engine combining intelligent retrieval, paper understanding, and multi-document synthesis to help researchers navigate the literature.

Preprint 2025

A unified ecosystem for constructing large-scale training environments, enabling systematic agent capability building through scalable environment generation and curriculum design.

Preprint 2025

Automatically assesses the novelty of scientific submissions via retrieval-augmented verification, producing traceable, evidence-backed judgments for peer review.

ACL 2025

A framework for training generally-capable agents across 14 environments and 89 tasks. Agents autonomously explore and evolve without step-by-step supervision, achieving 82.4% average success rate.

Preprint 2023

One of the earliest and most comprehensive surveys on LLM-based agents — covering perception, reasoning, planning, and action — widely cited as a foundational reference in the agent research community.

Alignment & Reinforcement Learning

Preprint 2025

Reveals that models trained to distinguish reference from target policies can serve as general-purpose reward models — a new paradigm for scalable RLHF without explicit reward annotation.

Preprint 2025

Balances positive and negative token contributions with adaptive clipping bounds to stabilize off-policy RL training for language models, maintaining entropy while improving learning efficiency.

Preprint 2024

A mixture-of-experts framework using multiple LoRA adapters with a learned router to preserve world knowledge during instruction tuning — keeping broad capabilities while gaining task-specific skills.

Preprint 2024

Tackles two fundamental problems in reward modeling: handling noisy human preference data and improving training robustness. Proposes methods to detect and correct label noise in preference annotations.

Preprint 2023

A pioneering empirical study of PPO instability in RLHF. Introduces PPO-max with token-level KL penalty for consistent alignment training. Open-sourced as MOSS-RLHF.

Multimodal & Systems

Preprint 2026

Enables humanoid robots to execute complex whole-body movements from natural language commands, bridging language understanding and motor control.

AAAI 2026

Converts standard multi-head attention into DeepSeek's economical multi-head latent attention for any transformer, dramatically reducing KV-cache and inference cost without retraining.

Preprint 2025

A reproducible framework for generating full-length songs with fine-grained control over style, structure, and instrumentation — supporting long-form coherent music creation.

Preprint 2024

An out-of-the-box multi-language sandbox providing unified feedback from compilers and static analysis tools, enabling LLMs to write, execute, and debug code across languages.

ACL 2021

A comprehensive toolkit for evaluating model robustness through automated transformations, adversarial attacks, and subpopulation analysis across 13 NLP tasks and multiple languages.

Full publication list on Google Scholar →

Visitor heatmap powered by ClustrMaps.