[BREAKING][vllm, fsdp] feat: add Rollout-Training Mismatch Fix -- Truncated importance sampling by yaof20 · Pull Request #2953 · verl-project/verl (original) (raw)

and others added 3 commits

August 6, 2025 09:30

@yaof20

@zdhNarsil

straight through trick for kl gradient estimation

@yaof20

[gemini-code-assist[bot]](/apps/gemini-code-assist)

@yaof20 @gemini-code-assist

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

wuxibin89

PeterSH6

eric-haibin-lin

@yaof20

eric-haibin-lin

@yaof20

yangalan123

yangalan123

garrett361 pushed a commit to garrett361/verl that referenced this pull request

Sep 23, 2025

…ncated importance sampling (verl-project#2953)

Support vLLM-FSDP off-policy importance sampling correction using Truncated Importance Sampling (TIS):

TIS

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

Demonstrate how the API changes if any, and provide usage example(s) if possible.

python3 -m verl.trainer.main_ppo \
    algorithm.adv_estimator=gae \
    data.train_files="$train_files" \
    data.val_files="$test_files" \
    data.train_batch_size=1024 \
    data.max_prompt_length=1024 \
    data.max_response_length=1024 \
    data.filter_overlong_prompts=True \
    data.truncation='error' \
    actor_rollout_ref.model.path=Qwen/Qwen2.5-32B-Instruct \
    actor_rollout_ref.model.enable_gradient_checkpointing=False \
    actor_rollout_ref.actor.optim.lr=1e-6 \
    actor_rollout_ref.model.use_remove_padding=True \
    actor_rollout_ref.actor.ppo_mini_batch_size=256 \
    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=8 \
    actor_rollout_ref.model.enable_gradient_checkpointing=True \
    actor_rollout_ref.actor.fsdp_config.param_offload=False \
    actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
    actor_rollout_ref.actor.use_kl_loss=False \
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=16 \
    actor_rollout_ref.rollout.tensor_model_parallel_size=4 \
    actor_rollout_ref.rollout.name=vllm \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.5 \
    critic.optim.lr=1e-5 \
    critic.model.use_remove_padding=True \
    critic.model.path=Qwen/Qwen2.5-32B-Instruct \
    critic.model.enable_gradient_checkpointing=False \
    critic.ppo_micro_batch_size_per_gpu=8 \
    critic.model.fsdp_config.param_offload=False \
    critic.model.fsdp_config.optimizer_offload=False \
    algorithm.use_kl_in_reward=False \
    trainer.critic_warmup=0 \
    trainer.logger='["console","wandb"]' \
    trainer.project_name='verl_example' \
    trainer.experiment_name='Qwen2.5-32B-Instruct_function_rm' \
    trainer.n_gpus_per_node=8 \
    trainer.nnodes=4 \
    trainer.save_freq=20 \
    trainer.test_freq=10 \
    trainer.total_epochs=15 \
    actor_rollout_ref.rollout.calculate_log_probs=True \   # add this config to return rollout prob
    +actor_rollout_ref.actor.behav_imp_weight_cap=10.0$@   # add this config to set up C value in TIS

Demonstrate the high-level design if this PR is complex, and list the specific changes.

[!IMPORTANT] Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.


Co-authored-by: Narsil-Dinghuai Zhang 张鼎怀 dinghuai233@gmail.com Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: LiyuanLucasLiu llychinalz@gmail.com

VocabVictor pushed a commit to VocabVictor/verl-plus that referenced this pull request

Sep 24, 2025

@kAIto47802

… functions (#3274)

What does this PR do?

In the recent PR:

the file workers/actor/dp_actor.py was updated so that rollout_log_probs is passed to policy_loss_fn:

https://github.com/volcengine/verl/blob/38d23914ee512a125e00763fe3ddcc8df4319346/verl/workers/actor/dp_actor.py#L448-L456

In that PR, the "vanilla" policy loss function was modified to accept rollout_log_probs as an argument. However, other policy loss functions (e.g., "gspo") were not updated accordingly, which leads to an error such as:

TypeError: compute_policy_loss_gspo() got an unexpected keyword argument 'rollout_log_probs'

when setting config.policy_loss.loss_mode to one of these alternatives.

Therefore, in this PR, rollout_log_probs is also added as an argument to the other policy loss functions.

Checklist Before Starting

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

[!IMPORTANT] Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

VocabVictor pushed a commit to VocabVictor/verl-plus that referenced this pull request

Sep 24, 2025

@sharonyu-115

What does this PR do?

Add the TIS support from verl-project/verl#2953 to megatron actor

Checklist Before Starting

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

[!IMPORTANT] Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Co-authored-by: Shuang Yu shuangy@shuangy-mlt.client.nvidia.com

fabianlim pushed a commit to fabianlim/verl that referenced this pull request

Oct 4, 2025

…ncated importance sampling (verl-project#2953)

Support vLLM-FSDP off-policy importance sampling correction using Truncated Importance Sampling (TIS):

TIS

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

Demonstrate how the API changes if any, and provide usage example(s) if possible.

python3 -m verl.trainer.main_ppo \
    algorithm.adv_estimator=gae \
    data.train_files="$train_files" \
    data.val_files="$test_files" \
    data.train_batch_size=1024 \
    data.max_prompt_length=1024 \
    data.max_response_length=1024 \
    data.filter_overlong_prompts=True \
    data.truncation='error' \
    actor_rollout_ref.model.path=Qwen/Qwen2.5-32B-Instruct \
    actor_rollout_ref.model.enable_gradient_checkpointing=False \
    actor_rollout_ref.actor.optim.lr=1e-6 \
    actor_rollout_ref.model.use_remove_padding=True \
    actor_rollout_ref.actor.ppo_mini_batch_size=256 \
    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=8 \
    actor_rollout_ref.model.enable_gradient_checkpointing=True \
    actor_rollout_ref.actor.fsdp_config.param_offload=False \
    actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
    actor_rollout_ref.actor.use_kl_loss=False \
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=16 \
    actor_rollout_ref.rollout.tensor_model_parallel_size=4 \
    actor_rollout_ref.rollout.name=vllm \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.5 \
    critic.optim.lr=1e-5 \
    critic.model.use_remove_padding=True \
    critic.model.path=Qwen/Qwen2.5-32B-Instruct \
    critic.model.enable_gradient_checkpointing=False \
    critic.ppo_micro_batch_size_per_gpu=8 \
    critic.model.fsdp_config.param_offload=False \
    critic.model.fsdp_config.optimizer_offload=False \
    algorithm.use_kl_in_reward=False \
    trainer.critic_warmup=0 \
    trainer.logger='["console","wandb"]' \
    trainer.project_name='verl_example' \
    trainer.experiment_name='Qwen2.5-32B-Instruct_function_rm' \
    trainer.n_gpus_per_node=8 \
    trainer.nnodes=4 \
    trainer.save_freq=20 \
    trainer.test_freq=10 \
    trainer.total_epochs=15 \
    actor_rollout_ref.rollout.calculate_log_probs=True \   # add this config to return rollout prob
    +actor_rollout_ref.actor.behav_imp_weight_cap=10.0$@   # add this config to set up C value in TIS

Demonstrate the high-level design if this PR is complex, and list the specific changes.

[!IMPORTANT] Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.


Co-authored-by: Narsil-Dinghuai Zhang 张鼎怀 dinghuai233@gmail.com Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: LiyuanLucasLiu llychinalz@gmail.com

WncFht pushed a commit to WncFht/verl that referenced this pull request

Oct 10, 2025

…ncated importance sampling (verl-project#2953)

What does this PR do?

Support vLLM-FSDP off-policy importance sampling correction using Truncated Importance Sampling (TIS):

TIS

Checklist Before Starting

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

python3 -m verl.trainer.main_ppo \
    algorithm.adv_estimator=gae \
    data.train_files="$train_files" \
    data.val_files="$test_files" \
    data.train_batch_size=1024 \
    data.max_prompt_length=1024 \
    data.max_response_length=1024 \
    data.filter_overlong_prompts=True \
    data.truncation='error' \
    actor_rollout_ref.model.path=Qwen/Qwen2.5-32B-Instruct \
    actor_rollout_ref.model.enable_gradient_checkpointing=False \
    actor_rollout_ref.actor.optim.lr=1e-6 \
    actor_rollout_ref.model.use_remove_padding=True \
    actor_rollout_ref.actor.ppo_mini_batch_size=256 \
    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=8 \
    actor_rollout_ref.model.enable_gradient_checkpointing=True \
    actor_rollout_ref.actor.fsdp_config.param_offload=False \
    actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
    actor_rollout_ref.actor.use_kl_loss=False \
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=16 \
    actor_rollout_ref.rollout.tensor_model_parallel_size=4 \
    actor_rollout_ref.rollout.name=vllm \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.5 \
    critic.optim.lr=1e-5 \
    critic.model.use_remove_padding=True \
    critic.model.path=Qwen/Qwen2.5-32B-Instruct \
    critic.model.enable_gradient_checkpointing=False \
    critic.ppo_micro_batch_size_per_gpu=8 \
    critic.model.fsdp_config.param_offload=False \
    critic.model.fsdp_config.optimizer_offload=False \
    algorithm.use_kl_in_reward=False \
    trainer.critic_warmup=0 \
    trainer.logger='["console","wandb"]' \
    trainer.project_name='verl_example' \
    trainer.experiment_name='Qwen2.5-32B-Instruct_function_rm' \
    trainer.n_gpus_per_node=8 \
    trainer.nnodes=4 \
    trainer.save_freq=20 \
    trainer.test_freq=10 \
    trainer.total_epochs=15 \
    actor_rollout_ref.rollout.calculate_log_probs=True \   # add this config to return rollout prob
    +actor_rollout_ref.actor.behav_imp_weight_cap=10.0$@   # add this config to set up C value in TIS

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

[!IMPORTANT] Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.


Co-authored-by: Narsil-Dinghuai Zhang 张鼎怀 dinghuai233@gmail.com Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: LiyuanLucasLiu llychinalz@gmail.com

masoudhashemi pushed a commit to masoudhashemi/verl that referenced this pull request

Oct 19, 2025

@kAIto47802 @masoudhashemi

… functions (verl-project#3274)

What does this PR do?

In the recent PR:

the file workers/actor/dp_actor.py was updated so that rollout_log_probs is passed to policy_loss_fn:

https://github.com/volcengine/verl/blob/38d23914ee512a125e00763fe3ddcc8df4319346/verl/workers/actor/dp_actor.py#L448-L456

In that PR, the "vanilla" policy loss function was modified to accept rollout_log_probs as an argument. However, other policy loss functions (e.g., "gspo") were not updated accordingly, which leads to an error such as:

TypeError: compute_policy_loss_gspo() got an unexpected keyword argument 'rollout_log_probs'

when setting config.policy_loss.loss_mode to one of these alternatives.

Therefore, in this PR, rollout_log_probs is also added as an argument to the other policy loss functions.

Checklist Before Starting

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

[!IMPORTANT] Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

masoudhashemi pushed a commit to masoudhashemi/verl that referenced this pull request

Oct 19, 2025

@sharonyu-115 @masoudhashemi

What does this PR do?

Add the TIS support from verl-project#2953 to megatron actor

Checklist Before Starting

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

[!IMPORTANT] Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Co-authored-by: Shuang Yu shuangy@shuangy-mlt.client.nvidia.com

techkang pushed a commit to techkang/verl that referenced this pull request

Oct 31, 2025

…ncated importance sampling (verl-project#2953)

What does this PR do?

Support vLLM-FSDP off-policy importance sampling correction using Truncated Importance Sampling (TIS):

TIS

Checklist Before Starting

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

python3 -m verl.trainer.main_ppo \
    algorithm.adv_estimator=gae \
    data.train_files="$train_files" \
    data.val_files="$test_files" \
    data.train_batch_size=1024 \
    data.max_prompt_length=1024 \
    data.max_response_length=1024 \
    data.filter_overlong_prompts=True \
    data.truncation='error' \
    actor_rollout_ref.model.path=Qwen/Qwen2.5-32B-Instruct \
    actor_rollout_ref.model.enable_gradient_checkpointing=False \
    actor_rollout_ref.actor.optim.lr=1e-6 \
    actor_rollout_ref.model.use_remove_padding=True \
    actor_rollout_ref.actor.ppo_mini_batch_size=256 \
    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=8 \
    actor_rollout_ref.model.enable_gradient_checkpointing=True \
    actor_rollout_ref.actor.fsdp_config.param_offload=False \
    actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
    actor_rollout_ref.actor.use_kl_loss=False \
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=16 \
    actor_rollout_ref.rollout.tensor_model_parallel_size=4 \
    actor_rollout_ref.rollout.name=vllm \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.5 \
    critic.optim.lr=1e-5 \
    critic.model.use_remove_padding=True \
    critic.model.path=Qwen/Qwen2.5-32B-Instruct \
    critic.model.enable_gradient_checkpointing=False \
    critic.ppo_micro_batch_size_per_gpu=8 \
    critic.model.fsdp_config.param_offload=False \
    critic.model.fsdp_config.optimizer_offload=False \
    algorithm.use_kl_in_reward=False \
    trainer.critic_warmup=0 \
    trainer.logger='["console","wandb"]' \
    trainer.project_name='verl_example' \
    trainer.experiment_name='Qwen2.5-32B-Instruct_function_rm' \
    trainer.n_gpus_per_node=8 \
    trainer.nnodes=4 \
    trainer.save_freq=20 \
    trainer.test_freq=10 \
    trainer.total_epochs=15 \
    actor_rollout_ref.rollout.calculate_log_probs=True \   # add this config to return rollout prob
    +actor_rollout_ref.actor.behav_imp_weight_cap=10.0$@   # add this config to set up C value in TIS

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

[!IMPORTANT] Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.


Co-authored-by: Narsil-Dinghuai Zhang 张鼎怀 dinghuai233@gmail.com Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: LiyuanLucasLiu llychinalz@gmail.com

techkang pushed a commit to techkang/verl that referenced this pull request

Oct 31, 2025

@kAIto47802 @techkang

… functions (verl-project#3274)

What does this PR do?

In the recent PR:

the file workers/actor/dp_actor.py was updated so that rollout_log_probs is passed to policy_loss_fn:

https://github.com/volcengine/verl/blob/38d23914ee512a125e00763fe3ddcc8df4319346/verl/workers/actor/dp_actor.py#L448-L456

In that PR, the "vanilla" policy loss function was modified to accept rollout_log_probs as an argument. However, other policy loss functions (e.g., "gspo") were not updated accordingly, which leads to an error such as:

TypeError: compute_policy_loss_gspo() got an unexpected keyword argument 'rollout_log_probs'

when setting config.policy_loss.loss_mode to one of these alternatives.

Therefore, in this PR, rollout_log_probs is also added as an argument to the other policy loss functions.

Checklist Before Starting

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

[!IMPORTANT] Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

techkang pushed a commit to techkang/verl that referenced this pull request

Oct 31, 2025

@sharonyu-115 @techkang

What does this PR do?

Add the TIS support from verl-project#2953 to megatron actor

Checklist Before Starting

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

[!IMPORTANT] Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Co-authored-by: Shuang Yu shuangy@shuangy-mlt.client.nvidia.com

mtian8 pushed a commit to mtian8/verl that referenced this pull request

Nov 1, 2025

@sharonyu-115

What does this PR do?

Add the TIS support from verl-project#2953 to megatron actor

Checklist Before Starting

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

[!IMPORTANT] Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Co-authored-by: Shuang Yu shuangy@shuangy-mlt.client.nvidia.com

chenjiaoAngel added a commit to chenjiaoAngel/verl that referenced this pull request

Nov 14, 2025

…ncated importance sampling (verl-project#2953)

What does this PR do?

Support vLLM-FSDP off-policy importance sampling correction using Truncated Importance Sampling (TIS):

TIS

Checklist Before Starting

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

python3 -m verl.trainer.main_ppo \
    algorithm.adv_estimator=gae \
    data.train_files="$train_files" \
    data.val_files="$test_files" \
    data.train_batch_size=1024 \
    data.max_prompt_length=1024 \
    data.max_response_length=1024 \
    data.filter_overlong_prompts=True \
    data.truncation='error' \
    actor_rollout_ref.model.path=Qwen/Qwen2.5-32B-Instruct \
    actor_rollout_ref.model.enable_gradient_checkpointing=False \
    actor_rollout_ref.actor.optim.lr=1e-6 \
    actor_rollout_ref.model.use_remove_padding=True \
    actor_rollout_ref.actor.ppo_mini_batch_size=256 \
    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=8 \
    actor_rollout_ref.model.enable_gradient_checkpointing=True \
    actor_rollout_ref.actor.fsdp_config.param_offload=False \
    actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
    actor_rollout_ref.actor.use_kl_loss=False \
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=16 \
    actor_rollout_ref.rollout.tensor_model_parallel_size=4 \
    actor_rollout_ref.rollout.name=vllm \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.5 \
    critic.optim.lr=1e-5 \
    critic.model.use_remove_padding=True \
    critic.model.path=Qwen/Qwen2.5-32B-Instruct \
    critic.model.enable_gradient_checkpointing=False \
    critic.ppo_micro_batch_size_per_gpu=8 \
    critic.model.fsdp_config.param_offload=False \
    critic.model.fsdp_config.optimizer_offload=False \
    algorithm.use_kl_in_reward=False \
    trainer.critic_warmup=0 \
    trainer.logger='["console","wandb"]' \
    trainer.project_name='verl_example' \
    trainer.experiment_name='Qwen2.5-32B-Instruct_function_rm' \
    trainer.n_gpus_per_node=8 \
    trainer.nnodes=4 \
    trainer.save_freq=20 \
    trainer.test_freq=10 \
    trainer.total_epochs=15 \
    actor_rollout_ref.rollout.calculate_log_probs=True \   # add this config to return rollout prob
    +actor_rollout_ref.actor.behav_imp_weight_cap=10.0$@   # add this config to set up C value in TIS

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

[!IMPORTANT] Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.


Co-authored-by: Narsil-Dinghuai Zhang 张鼎怀 dinghuai233@gmail.com Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: LiyuanLucasLiu llychinalz@gmail.com

chenjiaoAngel added a commit to chenjiaoAngel/verl that referenced this pull request

Nov 14, 2025

@chenjiaoAngel

… functions (verl-project#3274)

What does this PR do?

In the recent PR:

the file workers/actor/dp_actor.py was updated so that rollout_log_probs is passed to policy_loss_fn:

https://github.com/volcengine/verl/blob/38d23914ee512a125e00763fe3ddcc8df4319346/verl/workers/actor/dp_actor.py#L448-L456

In that PR, the "vanilla" policy loss function was modified to accept rollout_log_probs as an argument. However, other policy loss functions (e.g., "gspo") were not updated accordingly, which leads to an error such as:

TypeError: compute_policy_loss_gspo() got an unexpected keyword argument 'rollout_log_probs'

when setting config.policy_loss.loss_mode to one of these alternatives.

Therefore, in this PR, rollout_log_probs is also added as an argument to the other policy loss functions.

Checklist Before Starting

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

[!IMPORTANT] Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

chenjiaoAngel added a commit to chenjiaoAngel/verl that referenced this pull request

Nov 14, 2025

@chenjiaoAngel

What does this PR do?

Add the TIS support from verl-project#2953 to megatron actor

Checklist Before Starting

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

[!IMPORTANT] Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Co-authored-by: Shuang Yu shuangy@shuangy-mlt.client.nvidia.com

NenoL2001 pushed a commit to NenoL2001/verl that referenced this pull request

Nov 26, 2025

@sharonyu-115

What does this PR do?

Add the TIS support from verl-project#2953 to megatron actor

Checklist Before Starting

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

[!IMPORTANT] Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Co-authored-by: Shuang Yu shuangy@shuangy-mlt.client.nvidia.com

paolo328 added a commit to paolo328/Verl that referenced this pull request

Nov 27, 2025

@paolo328

… functions (#3274)

What does this PR do?

In the recent PR:

the file workers/actor/dp_actor.py was updated so that rollout_log_probs is passed to policy_loss_fn:

https://github.com/volcengine/verl/blob/38d23914ee512a125e00763fe3ddcc8df4319346/verl/workers/actor/dp_actor.py#L448-L456

In that PR, the "vanilla" policy loss function was modified to accept rollout_log_probs as an argument. However, other policy loss functions (e.g., "gspo") were not updated accordingly, which leads to an error such as:

TypeError: compute_policy_loss_gspo() got an unexpected keyword argument 'rollout_log_probs'

when setting config.policy_loss.loss_mode to one of these alternatives.

Therefore, in this PR, rollout_log_probs is also added as an argument to the other policy loss functions.

Checklist Before Starting

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

[!IMPORTANT] Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

paolo328 added a commit to paolo328/Verl that referenced this pull request

Nov 27, 2025

@paolo328

What does this PR do?

Add the TIS support from verl-project/verl#2953 to megatron actor

Checklist Before Starting

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

[!IMPORTANT] Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Co-authored-by: Shuang Yu shuangy@shuangy-mlt.client.nvidia.com

garrett361 pushed a commit to garrett361/verl that referenced this pull request

Dec 2, 2025

…ncated importance sampling (verl-project#2953)

Support vLLM-FSDP off-policy importance sampling correction using Truncated Importance Sampling (TIS):

TIS

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

Demonstrate how the API changes if any, and provide usage example(s) if possible.

python3 -m verl.trainer.main_ppo \
    algorithm.adv_estimator=gae \
    data.train_files="$train_files" \
    data.val_files="$test_files" \
    data.train_batch_size=1024 \
    data.max_prompt_length=1024 \
    data.max_response_length=1024 \
    data.filter_overlong_prompts=True \
    data.truncation='error' \
    actor_rollout_ref.model.path=Qwen/Qwen2.5-32B-Instruct \
    actor_rollout_ref.model.enable_gradient_checkpointing=False \
    actor_rollout_ref.actor.optim.lr=1e-6 \
    actor_rollout_ref.model.use_remove_padding=True \
    actor_rollout_ref.actor.ppo_mini_batch_size=256 \
    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=8 \
    actor_rollout_ref.model.enable_gradient_checkpointing=True \
    actor_rollout_ref.actor.fsdp_config.param_offload=False \
    actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
    actor_rollout_ref.actor.use_kl_loss=False \
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=16 \
    actor_rollout_ref.rollout.tensor_model_parallel_size=4 \
    actor_rollout_ref.rollout.name=vllm \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.5 \
    critic.optim.lr=1e-5 \
    critic.model.use_remove_padding=True \
    critic.model.path=Qwen/Qwen2.5-32B-Instruct \
    critic.model.enable_gradient_checkpointing=False \
    critic.ppo_micro_batch_size_per_gpu=8 \
    critic.model.fsdp_config.param_offload=False \
    critic.model.fsdp_config.optimizer_offload=False \
    algorithm.use_kl_in_reward=False \
    trainer.critic_warmup=0 \
    trainer.logger='["console","wandb"]' \
    trainer.project_name='verl_example' \
    trainer.experiment_name='Qwen2.5-32B-Instruct_function_rm' \
    trainer.n_gpus_per_node=8 \
    trainer.nnodes=4 \
    trainer.save_freq=20 \
    trainer.test_freq=10 \
    trainer.total_epochs=15 \
    actor_rollout_ref.rollout.calculate_log_probs=True \   # add this config to return rollout prob
    +actor_rollout_ref.actor.behav_imp_weight_cap=10.0$@   # add this config to set up C value in TIS

Demonstrate the high-level design if this PR is complex, and list the specific changes.

[!IMPORTANT] Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.


Co-authored-by: Narsil-Dinghuai Zhang 张鼎怀 dinghuai233@gmail.com Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: LiyuanLucasLiu llychinalz@gmail.com

TimurTaepov pushed a commit to giorgossideris/verl that referenced this pull request

Dec 20, 2025

…ncated importance sampling (verl-project#2953)

What does this PR do?

Support vLLM-FSDP off-policy importance sampling correction using Truncated Importance Sampling (TIS):

TIS

Checklist Before Starting

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

python3 -m verl.trainer.main_ppo \
    algorithm.adv_estimator=gae \
    data.train_files="$train_files" \
    data.val_files="$test_files" \
    data.train_batch_size=1024 \
    data.max_prompt_length=1024 \
    data.max_response_length=1024 \
    data.filter_overlong_prompts=True \
    data.truncation='error' \
    actor_rollout_ref.model.path=Qwen/Qwen2.5-32B-Instruct \
    actor_rollout_ref.model.enable_gradient_checkpointing=False \
    actor_rollout_ref.actor.optim.lr=1e-6 \
    actor_rollout_ref.model.use_remove_padding=True \
    actor_rollout_ref.actor.ppo_mini_batch_size=256 \
    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=8 \
    actor_rollout_ref.model.enable_gradient_checkpointing=True \
    actor_rollout_ref.actor.fsdp_config.param_offload=False \
    actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
    actor_rollout_ref.actor.use_kl_loss=False \
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=16 \
    actor_rollout_ref.rollout.tensor_model_parallel_size=4 \
    actor_rollout_ref.rollout.name=vllm \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.5 \
    critic.optim.lr=1e-5 \
    critic.model.use_remove_padding=True \
    critic.model.path=Qwen/Qwen2.5-32B-Instruct \
    critic.model.enable_gradient_checkpointing=False \
    critic.ppo_micro_batch_size_per_gpu=8 \
    critic.model.fsdp_config.param_offload=False \
    critic.model.fsdp_config.optimizer_offload=False \
    algorithm.use_kl_in_reward=False \
    trainer.critic_warmup=0 \
    trainer.logger='["console","wandb"]' \
    trainer.project_name='verl_example' \
    trainer.experiment_name='Qwen2.5-32B-Instruct_function_rm' \
    trainer.n_gpus_per_node=8 \
    trainer.nnodes=4 \
    trainer.save_freq=20 \
    trainer.test_freq=10 \
    trainer.total_epochs=15 \
    actor_rollout_ref.rollout.calculate_log_probs=True \   # add this config to return rollout prob
    +actor_rollout_ref.actor.behav_imp_weight_cap=10.0$@   # add this config to set up C value in TIS

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

[!IMPORTANT] Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.


Co-authored-by: Narsil-Dinghuai Zhang 张鼎怀 dinghuai233@gmail.com Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: LiyuanLucasLiu llychinalz@gmail.com

TimurTaepov pushed a commit to giorgossideris/verl that referenced this pull request

Dec 20, 2025

@kAIto47802

… functions (verl-project#3274)

What does this PR do?

In the recent PR:

the file workers/actor/dp_actor.py was updated so that rollout_log_probs is passed to policy_loss_fn:

https://github.com/volcengine/verl/blob/38d23914ee512a125e00763fe3ddcc8df4319346/verl/workers/actor/dp_actor.py#L448-L456

In that PR, the "vanilla" policy loss function was modified to accept rollout_log_probs as an argument. However, other policy loss functions (e.g., "gspo") were not updated accordingly, which leads to an error such as:

TypeError: compute_policy_loss_gspo() got an unexpected keyword argument 'rollout_log_probs'

when setting config.policy_loss.loss_mode to one of these alternatives.

Therefore, in this PR, rollout_log_probs is also added as an argument to the other policy loss functions.

Checklist Before Starting

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

[!IMPORTANT] Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

TimurTaepov pushed a commit to giorgossideris/verl that referenced this pull request

Dec 20, 2025

@sharonyu-115

What does this PR do?

Add the TIS support from verl-project#2953 to megatron actor

Checklist Before Starting

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

[!IMPORTANT] Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Co-authored-by: Shuang Yu shuangy@shuangy-mlt.client.nvidia.com

vyomakesh0728 added a commit to vyomakesh0728/verl that referenced this pull request

Jan 22, 2026

…ncated importance sampling (verl-project#2953)

What does this PR do?

Support vLLM-FSDP off-policy importance sampling correction using Truncated Importance Sampling (TIS):

TIS

Checklist Before Starting

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

python3 -m verl.trainer.main_ppo \
    algorithm.adv_estimator=gae \
    data.train_files="$train_files" \
    data.val_files="$test_files" \
    data.train_batch_size=1024 \
    data.max_prompt_length=1024 \
    data.max_response_length=1024 \
    data.filter_overlong_prompts=True \
    data.truncation='error' \
    actor_rollout_ref.model.path=Qwen/Qwen2.5-32B-Instruct \
    actor_rollout_ref.model.enable_gradient_checkpointing=False \
    actor_rollout_ref.actor.optim.lr=1e-6 \
    actor_rollout_ref.model.use_remove_padding=True \
    actor_rollout_ref.actor.ppo_mini_batch_size=256 \
    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=8 \
    actor_rollout_ref.model.enable_gradient_checkpointing=True \
    actor_rollout_ref.actor.fsdp_config.param_offload=False \
    actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
    actor_rollout_ref.actor.use_kl_loss=False \
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=16 \
    actor_rollout_ref.rollout.tensor_model_parallel_size=4 \
    actor_rollout_ref.rollout.name=vllm \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.5 \
    critic.optim.lr=1e-5 \
    critic.model.use_remove_padding=True \
    critic.model.path=Qwen/Qwen2.5-32B-Instruct \
    critic.model.enable_gradient_checkpointing=False \
    critic.ppo_micro_batch_size_per_gpu=8 \
    critic.model.fsdp_config.param_offload=False \
    critic.model.fsdp_config.optimizer_offload=False \
    algorithm.use_kl_in_reward=False \
    trainer.critic_warmup=0 \
    trainer.logger='["console","wandb"]' \
    trainer.project_name='verl_example' \
    trainer.experiment_name='Qwen2.5-32B-Instruct_function_rm' \
    trainer.n_gpus_per_node=8 \
    trainer.nnodes=4 \
    trainer.save_freq=20 \
    trainer.test_freq=10 \
    trainer.total_epochs=15 \
    actor_rollout_ref.rollout.calculate_log_probs=True \   # add this config to return rollout prob
    +actor_rollout_ref.actor.behav_imp_weight_cap=10.0$@   # add this config to set up C value in TIS

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

[!IMPORTANT] Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.


Co-authored-by: Narsil-Dinghuai Zhang 张鼎怀 dinghuai233@gmail.com Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: LiyuanLucasLiu llychinalz@gmail.com

vyomakesh0728 added a commit to vyomakesh0728/verl that referenced this pull request

Jan 22, 2026

@vyomakesh0728

… functions (verl-project#3274)

What does this PR do?

In the recent PR:

the file workers/actor/dp_actor.py was updated so that rollout_log_probs is passed to policy_loss_fn:

https://github.com/volcengine/verl/blob/38d23914ee512a125e00763fe3ddcc8df4319346/verl/workers/actor/dp_actor.py#L448-L456

In that PR, the "vanilla" policy loss function was modified to accept rollout_log_probs as an argument. However, other policy loss functions (e.g., "gspo") were not updated accordingly, which leads to an error such as:

TypeError: compute_policy_loss_gspo() got an unexpected keyword argument 'rollout_log_probs'

when setting config.policy_loss.loss_mode to one of these alternatives.

Therefore, in this PR, rollout_log_probs is also added as an argument to the other policy loss functions.

Checklist Before Starting

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

[!IMPORTANT] Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

vyomakesh0728 added a commit to vyomakesh0728/verl that referenced this pull request

Jan 22, 2026

@vyomakesh0728

What does this PR do?

Add the TIS support from verl-project#2953 to megatron actor

Checklist Before Starting

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

[!IMPORTANT] Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Co-authored-by: Shuang Yu shuangy@shuangy-mlt.client.nvidia.com

tongyx361 pushed a commit that referenced this pull request

Apr 20, 2026

@MaxwellJryao

…6058)

What does this PR do?

Fix a long-standing bug in core_algos.kl_penalty that makes every +-suffixed estimator name (k1+, kl+, abs+, k3+, low_var_kl+) crash with NotImplementedError on the first training step that touches KL — either through algorithm.kl_penalty (KL-in-reward) or actor.kl_loss_type (actor KL loss).

The straight-through trick that the + suffix is supposed to enable (unbiased k3 value with unbiased k2 gradient, see Schulman, Approximating KL divergence) was added in #2953 but has never actually worked end-to-end because the suffix was forwarded to kl_penalty_forward without being stripped, causing the dispatch to fall through to raise NotImplementedError.

Checklist Before Starting

Test

Added two CPU regression tests in tests/trainer/ppo/test_core_algos_on_cpu.py that get picked up automatically by cpu_unit_tests.yml (tests/**/test_*_on_cpu.py):

  1. test_kl_penalty_straight_through_value_matches_base — parametrized over (k1+, kl+, abs+, k3+, low_var_kl+), asserts that the suffixed estimator returns the same forward value as its base.
  2. test_kl_penalty_k3_plus_uses_k2_gradient — asserts that the gradient w.r.t. logprob produced by k3+ is exactly the gradient produced by k2, i.e. the straight-through trick is wired correctly.

Local run on top of the change:

$ pytest -xvs tests/trainer/ppo/test_core_algos_on_cpu.py
============================= test session starts ==============================
collected 22 items
tests/trainer/ppo/test_core_algos_on_cpu.py ......................   [100%]
============================== 22 passed in 4.79s ==============================

Without the fix, every ..._straight_through_value_matches_base[*] case raises NotImplementedError in kl_penalty_forward.

API and Usage Example

No API change. The fix simply makes the previously-documented + suffix work as advertised. After this PR the following configurations are usable (they currently crash on step 1):

# KL-in-reward with the straight-through trick:
python -m verl.trainer.main_ppo \
    algorithm.use_kl_in_reward=True \
    algorithm.kl_penalty=k3+ \
    algorithm.kl_ctrl.kl_coef=1e-3

# Actor KL loss with the straight-through trick:
python -m verl.trainer.main_ppo \
    actor_rollout_ref.actor.use_kl_loss=True \
    actor_rollout_ref.actor.kl_loss_type=low_var_kl+ \
    actor_rollout_ref.actor.kl_loss_coef=1e-3

Design & Code Changes

Root cause is a single-line oversight in core_algos.kl_penalty:

# before
forward_score = kl_penalty_forward(logprob, ref_logprob, kl_penalty)
if not kl_penalty.endswith("+") or kl_penalty in ("mse", "k2"):
    return forward_score

kl_penalty_forward only recognizes the bare names (k1, kl, abs, mse, k2, k3, low_var_kl, full), so it falls through to raise NotImplementedError whenever the input ends with +.

Fix: strip the optional suffix before dispatch, leaving the outer endswith("+") switch untouched so the straight-through path is still selected correctly:

# after
base_kl_penalty = kl_penalty[:-1] if kl_penalty.endswith("+") else kl_penalty
forward_score = kl_penalty_forward(logprob, ref_logprob, base_kl_penalty)
if not kl_penalty.endswith("+") or kl_penalty in ("mse", "k2"):
    return forward_score

Files changed:

Checklist Before Submitting

xvlincaigou pushed a commit to xvlincaigou/verl that referenced this pull request

May 19, 2026

@MaxwellJryao

…erl-project#6058)

What does this PR do?

Fix a long-standing bug in core_algos.kl_penalty that makes every +-suffixed estimator name (k1+, kl+, abs+, k3+, low_var_kl+) crash with NotImplementedError on the first training step that touches KL — either through algorithm.kl_penalty (KL-in-reward) or actor.kl_loss_type (actor KL loss).

The straight-through trick that the + suffix is supposed to enable (unbiased k3 value with unbiased k2 gradient, see Schulman, Approximating KL divergence) was added in verl-project#2953 but has never actually worked end-to-end because the suffix was forwarded to kl_penalty_forward without being stripped, causing the dispatch to fall through to raise NotImplementedError.

Checklist Before Starting

Test

Added two CPU regression tests in tests/trainer/ppo/test_core_algos_on_cpu.py that get picked up automatically by cpu_unit_tests.yml (tests/**/test_*_on_cpu.py):

  1. test_kl_penalty_straight_through_value_matches_base — parametrized over (k1+, kl+, abs+, k3+, low_var_kl+), asserts that the suffixed estimator returns the same forward value as its base.
  2. test_kl_penalty_k3_plus_uses_k2_gradient — asserts that the gradient w.r.t. logprob produced by k3+ is exactly the gradient produced by k2, i.e. the straight-through trick is wired correctly.

Local run on top of the change:

$ pytest -xvs tests/trainer/ppo/test_core_algos_on_cpu.py
============================= test session starts ==============================
collected 22 items
tests/trainer/ppo/test_core_algos_on_cpu.py ......................   [100%]
============================== 22 passed in 4.79s ==============================

Without the fix, every ..._straight_through_value_matches_base[*] case raises NotImplementedError in kl_penalty_forward.

API and Usage Example

No API change. The fix simply makes the previously-documented + suffix work as advertised. After this PR the following configurations are usable (they currently crash on step 1):

# KL-in-reward with the straight-through trick:
python -m verl.trainer.main_ppo \
    algorithm.use_kl_in_reward=True \
    algorithm.kl_penalty=k3+ \
    algorithm.kl_ctrl.kl_coef=1e-3

# Actor KL loss with the straight-through trick:
python -m verl.trainer.main_ppo \
    actor_rollout_ref.actor.use_kl_loss=True \
    actor_rollout_ref.actor.kl_loss_type=low_var_kl+ \
    actor_rollout_ref.actor.kl_loss_coef=1e-3

Design & Code Changes

Root cause is a single-line oversight in core_algos.kl_penalty:

# before
forward_score = kl_penalty_forward(logprob, ref_logprob, kl_penalty)
if not kl_penalty.endswith("+") or kl_penalty in ("mse", "k2"):
    return forward_score

kl_penalty_forward only recognizes the bare names (k1, kl, abs, mse, k2, k3, low_var_kl, full), so it falls through to raise NotImplementedError whenever the input ends with +.

Fix: strip the optional suffix before dispatch, leaving the outer endswith("+") switch untouched so the straight-through path is still selected correctly:

# after
base_kl_penalty = kl_penalty[:-1] if kl_penalty.endswith("+") else kl_penalty
forward_score = kl_penalty_forward(logprob, ref_logprob, base_kl_penalty)
if not kl_penalty.endswith("+") or kl_penalty in ("mse", "k2"):
    return forward_score

Files changed:

Checklist Before Submitting

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})