GitHub - Gen-Verse/dLLM-RL: [ICLR 2026] Official code for TraceRL: Revolutionizing post-training for Diffusion LLMs, powering the SOTA TraDo series. (original) (raw)

Revolutionizing Reinforcement Learning Framework for Diffusion Large Language Models

Most comprehensive framework for dLLM's and multimodal dLLM's post-training

🌱 Features

Model Support: TraDo, SDAR, Dream, LLaDA, MMaDA, LLaDA-V, and Diffu-Coder Almost all open-sourced discrete diffusion language models are supported here.
Diverse Settings: We support deployment, SFT, RL (with optional value model for variance reduction and process reward model for fine-grained supervision), and RLHF across diverse settings (math, coding, multimodal) and different architectures (both full/block attention dLLMs).
Inference Acceleration: improved KV-cache, jetengine (based on nano-vllm), different sampling strategies, support multi-nodes, easy to build your own accelerated inference methods.
RL Training: TraceRL (support diffusion value model), coupled RL, random masking RL, accelerated sampling, including Math, coding, and general RL tasks, support multi-nodes, easy to build your reinforcement learning methods across diverse settings
SFT: Block SFT, semi-AR SFT, random masking SFT, support multi-nodes and long-CoT finetune.

🧠 RL Methods (TraceRL) & Models (TraDo)

We propose TraceRL, a trajectory-aware reinforcement learning method for diffusion language models, which demonstrates the best performance among RL approaches for DLMs. We also introduce a diffusion-based value model that reduces variance and improves stability during optimization.

Based on TraceRL, we derive a series of diffusion language models, TraDo, which achieve state-of-the-art performance on math and coding reasoning tasks. TraDo-4B-Instruct and TraDo-8B-Instruct are trained solely with TraceRL, while the first long-CoT diffusion language model, TraDo-8B-Thinking, is obtained through a combination of TraceRL and long-CoT data SFT. TraDo models challenge AR models with strong empirical results, as shown in the following table.

We can download and try our model:

from transformers import AutoModelForCausalLM, AutoTokenizer from generate import block_diffusion_generate

model_name = "Gen-Verse/TraDo-8B-Instruct"

model = AutoModelForCausalLM.from_pretrained( model_name, trust_remote_code=True, torch_dtype="float16", device_map="cuda" ) tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

prompt = "What's the solution of x^2 - 2x + 1 = 0\nPlease reason step by step, and put your final answer within \boxed{}.\n" messages = [{"role": "user", "content": prompt}] text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

tokens = tokenizer.batch_encode_plus([text], return_tensors='pt', padding=True, truncation=True, max_length=200) tokens = {k: v.to(model.device) for k, v in tokens.items()}

output_ids = block_diffusion_generate( model, prompt=tokens, mask_id=151669, gen_length=200, block_length=4, denoising_steps=4, temperature=1.0, top_k=0, top_p=1.0, remasking_strategy="low_confidence_dynamic", confidence_threshold=0.9 )

output_text = tokenizer.decode(output_ids[0], skip_special_tokens=False) cleaned_text = output_text.replace('<|MASK|>', '').replace('<|endoftext|>', '') print(cleaned_text)

📰 Latest Updates

[2026-01-26] Our TraceRL paper has been accepted by ICLR 2026!
[2025-12-07] 🔥 We support RLHF, fine-grained process reward, and multimodal RL/SFT now!
[2025-09-08] We release our models, TraDo-4B-Instruct and TraDo-8B-Instruct, and the long-CoT diffusion language model TraDo-8B-Thinking.
[2025-09-08] We release inference and training (SFT and RL) code compatible with a wide range of diffusion language models, including TraDo, SDAR, Dream, LLaDA, MMaDA, and Diffu-Coder.

🚀 Quick Start

conda create --name dllm-rl python=3.10 source activate dllm-rl pip install torch==2.6.0 pip install --no-cache-dir
https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/\ flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp310-cp310-linux_x86_64.whl pip install -r requirements.txt

or requirements_v.txt for multimodal settings, see more details in the multimodal section in ./configs

⚙️ Data

You can navigate to ./data to download datasets for evaluation and training, for example as follows. In that directory, you will also find detailed instructions on how to modify your own dataset.

cd data python download_data.py --dataset MATH500 python download_data.py --dataset MATH_train cd ..

After downloading the data, you are almost ready to evaluate or train diffusion language models. The only remaining step is to select (or create) a config file in ./configs that corresponds to your project, and then use the following commands. Details on how to select and modify (or create) a config file are provided in ./configs.

📊 Inference & Evaluations

After downloading the data, take TraDo models as an example. You can set the configurations in configs/trado_eval.yaml (see instructions and details in ./configs) and run the following commands to perform inference with different sampling strategies.

python eval.py config=configs/trado_eval.yaml

python eval.py config=configs/trado_longcot_eval.yaml

python eval.py config=configs/sdar_eval.yaml

python eval.py config=configs/dream_eval.yaml

python eval.py config=configs/llada_eval.yaml

python eval_v.py config=configs/lladav_eval.yaml

python eval_v.py config=configs/mmada_v_eval.yaml

see details in ./configs

Use trado_eval.yaml for TraDo models' inference, sdar_eval.yaml for SDAR, dream_eval.yaml for Dream and Diffu-Coder, and llada_eval.yaml for LLaDA and MMaDA. Instructions on how to set the configurations are provided in the corresponding configuration files.
We support both general tasks and coding tasks (including automated execution of code) in evaluation.

There are two main sampling methods you can choose:

Static Sampling: unmask fixed number of tokens each time

Dynamic Sampling: unmask tokens based on a chosen threshold, faster than static

To have a look how diffusion language models sample, open ./sample/trace.viewer.html in your browser, or generate trajectory by your self with ./sample/get_trace_viewer.py.

You can also perform inference across multiple nodes using multinode_eval.py with the same configuration files, with only minor modifications as instructed in the configuration files. In multi-node setup, the first node controls the others. You can run
python multinode_eval.py config=configs/dream_multinode_eval.yaml on the first node to eval, or submit the following as the entry command for a job:

if [[ ${MLP_ROLE_INDEX:-0} -eq 0 ]]; then
python multinode_eval.py config=configs/dream_multinode_eval.yaml else exec tail -f /dev/null fi

python multinode_eval.py config=configs/trado_longcot_multinode_eval.yaml

python multinode_eval.py config=configs/llada_multinode_eval.yaml

python multinode_eval_v.py config=configs/lladav_eval.yaml

python multinode_eval_v.py config=configs/mmada_v_eval.yaml

...

🔧 Reinforcement Learning

After downloading the data and model and setting the configuration, you can start reinforcement learning simply with:

python rl.py config=configs/rl_trado.yaml

python rl.py config=configs/rl_sdar.yaml

python rl.py config=configs/rl_dream.yaml

python rl.py config=configs/rl_llada.yaml

python rl.py config=configs/rl_mmada.yaml

python rl_v.py config=configs/rl_lladav.yaml

python rl_v.py config=configs/rl_mmada_v.yaml

see details in ./configs

We support TraceRL (optionally with a diffusion-based value model), Coupled RL, and random masking RL across different diffusion language models. The sampling process has been accelerated in all cases by KV-cache.

TraceRL: We optimize the policy based on how it generates sequences. For block-attention models, training can be performed efficiently thanks to block attention. For full-attention models, we introduce a shrinkage parameter, s, that aggregates every s neighboring steps to accelerate training. We also provide a choice of value models for TraceRL, which we find can reduce variance and improve training stability, enabling the use of larger learning rates or fewer gradient accumulation steps more reliably than without using value model.

Random Masking RL: The sampled data are randomly masked and used as training data in RL with a PPO-like objective.

Coupled RL: For each sampled random masking setting, Coupled RL additionally introduces its complement, serving as an extra data sample for training.

We also support a multi-node RL framework; you can submit the following as the entry command:

if [[ ${MLP_ROLE_INDEX:-0} -eq 0 ]]; then
python multinode_rl.py config=configs/multinode_rl_trado.yaml else exec tail -f /dev/null fi

python multinode_rl.py config=configs/multinode_rl_sdar.yaml

python multinode_rl.py config=configs/multinode_rl_dream.yaml

python multinode_rl.py config=configs/multinode_rl_llada.yaml

python multinode_rl.py config=configs/multinode_rl_mmada.yaml

python multinode_rl_v.py config=configs/multinode_rl_lladav.yaml

python multinode_rl_v.py config=configs/multinode_rl_mmada_v.yaml

🔧 Supervised Finetuning

After downloading the data and setting the configurations, you can start supervised fine-tuning with:

accelerate launch
--num_machines 1
--machine_rank 0
--main_process_ip 127.0.0.1
--main_process_port 8888
--config_file accelerate_configs/1_node_8_gpus_deepspeed_zero3.yaml
train/sft_trado.py
config=configs/sft_trado.yaml

sft_sdar.py, sft_sdar.yaml

sft_dream.py, sft_dream.yaml

sft_llada.py, sft_llada.yaml

sft_mmada.py, sft_mmada.yaml

sft_mmada_v.py, sft_mmada_v.yaml

sft_lladav.py, sft_lladav.yaml

see details in ./configs

We support different SFT strategies for different models.

Block diffusion models (e.g., TraDo and SDAR): support semi-autoregressive fine-tuning or trace fine-tuning (requires setting a specific trace first).

Adapted full-attention models (e.g., Dream and DiffuCoder): support the semi-autoregressive method (using sliced data), random-masking SFT, and AR training (i.e., standard SFT for LLMs).

Pretrained full-attention models (e.g., LLaDA and MMaDA): support semi-autoregressive and random-masking SFT.

To use multi-node, simply run:

accelerate launch
--num_machines $MLP_WORKER_NUM
--machine_rank $MLP_ROLE_INDEX
--main_process_ip $MLP_WORKER_0_HOST
--main_process_port $MLP_WORKER_0_PORT
--config_file accelerate_configs/4_node_8_gpus_deepspeed_zero3.yaml
train/sft_dream.py
config=configs/sft_dream.yaml

sft_trado.py, sft_trado.yaml

...

🤝 Acknowledgement

This work is heavily built on the following open-source models:

SDAR, Dream, LLaDA, MMaDA, LLaDA-V, and Diffu-coder.

these acceleration methods (engines):

Fast-dllm, jetengine,

and theoretical foundations:

MDLM, DiffuLLaMA, Block Diffusion.

📖 Citation

@article{wang2025revolutionizing,
  title={Revolutionizing reinforcement learning framework for diffusion large language models},
  author={Wang, Yinjie and Yang, Ling and Li, Bowen and Tian, Ye and Shen, Ke and Wang, Mengdi},
  journal={arXiv preprint arXiv:2509.06949},
  year={2025}
}