[BREAKING][vllm, fsdp] feat: add Rollout-Training Mismatch Fix -- Truncated importance sampling by yaof20 · Pull Request #2953 · verl-project/verl (original) (raw)
and others added 3 commits
straight through trick for kl gradient estimation
[](/apps/gemini-code-assist)
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
garrett361 pushed a commit to garrett361/verl that referenced this pull request
…ncated importance sampling (verl-project#2953)
Support vLLM-FSDP off-policy importance sampling correction using Truncated Importance Sampling (TIS):
- Search for similar PRs. Paste at least one query link here: ...
- Format the PR title as
[{modules}] {type}: {description}(This will be checked by the CI) {modules}includefsdp,megatron,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data- If this PR involves multiple modules, separate them with
,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add
[BREAKING]to the beginning of the title.- Example:
[BREAKING][fsdp, megatron] feat: dynamic batching
- Example:
For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.
Demonstrate how the API changes if any, and provide usage example(s) if possible.
python3 -m verl.trainer.main_ppo \
algorithm.adv_estimator=gae \
data.train_files="$train_files" \
data.val_files="$test_files" \
data.train_batch_size=1024 \
data.max_prompt_length=1024 \
data.max_response_length=1024 \
data.filter_overlong_prompts=True \
data.truncation='error' \
actor_rollout_ref.model.path=Qwen/Qwen2.5-32B-Instruct \
actor_rollout_ref.model.enable_gradient_checkpointing=False \
actor_rollout_ref.actor.optim.lr=1e-6 \
actor_rollout_ref.model.use_remove_padding=True \
actor_rollout_ref.actor.ppo_mini_batch_size=256 \
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=8 \
actor_rollout_ref.model.enable_gradient_checkpointing=True \
actor_rollout_ref.actor.fsdp_config.param_offload=False \
actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
actor_rollout_ref.actor.use_kl_loss=False \
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=16 \
actor_rollout_ref.rollout.tensor_model_parallel_size=4 \
actor_rollout_ref.rollout.name=vllm \
actor_rollout_ref.rollout.gpu_memory_utilization=0.5 \
critic.optim.lr=1e-5 \
critic.model.use_remove_padding=True \
critic.model.path=Qwen/Qwen2.5-32B-Instruct \
critic.model.enable_gradient_checkpointing=False \
critic.ppo_micro_batch_size_per_gpu=8 \
critic.model.fsdp_config.param_offload=False \
critic.model.fsdp_config.optimizer_offload=False \
algorithm.use_kl_in_reward=False \
trainer.critic_warmup=0 \
trainer.logger='["console","wandb"]' \
trainer.project_name='verl_example' \
trainer.experiment_name='Qwen2.5-32B-Instruct_function_rm' \
trainer.n_gpus_per_node=8 \
trainer.nnodes=4 \
trainer.save_freq=20 \
trainer.test_freq=10 \
trainer.total_epochs=15 \
actor_rollout_ref.rollout.calculate_log_probs=True \ # add this config to return rollout prob
+actor_rollout_ref.actor.behav_imp_weight_cap=10.0$@ # add this config to set up C value in TISDemonstrate the high-level design if this PR is complex, and list the specific changes.
[!IMPORTANT] Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
- Read the Contribute Guide.
- Apply pre-commit
checks:
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always - Add / Update the documentation.
- Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: ...
- Once your PR is ready for CI, send a message in the
ci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)
Co-authored-by: Narsil-Dinghuai Zhang 张鼎怀 dinghuai233@gmail.com Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: LiyuanLucasLiu llychinalz@gmail.com
VocabVictor pushed a commit to VocabVictor/verl-plus that referenced this pull request
… functions (#3274)
What does this PR do?
In the recent PR:
the file workers/actor/dp_actor.py was updated so that
rollout_log_probs is passed to policy_loss_fn:
In that PR, the "vanilla" policy loss function was modified to accept
rollout_log_probs as an argument. However, other policy loss functions
(e.g., "gspo") were not updated accordingly, which leads to an error
such as:
TypeError: compute_policy_loss_gspo() got an unexpected keyword argument 'rollout_log_probs'when setting config.policy_loss.loss_mode to one of these
alternatives.
Therefore, in this PR, rollout_log_probs is also added as an argument
to the other policy loss functions.
Checklist Before Starting
- Search for similar PRs. Paste at least one query link here: ...
- Format the PR title as
[{modules}] {type}: {description}(This will be checked by the CI) {modules}includefsdp,megatron,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data- If this PR involves multiple modules, separate them with
,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add
[BREAKING]to the beginning of the title.- Example:
[BREAKING][fsdp, megatron] feat: dynamic batching
- Example:
Test
For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.
API and Usage Example
Demonstrate how the API changes if any, and provide usage example(s) if possible.
# Add code snippet or script demonstrating how to use thisDesign & Code Changes
Demonstrate the high-level design if this PR is complex, and list the specific changes.
Checklist Before Submitting
[!IMPORTANT] Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
- Read the Contribute Guide.
- Apply pre-commit
checks:
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always - Add / Update the documentation.
- Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: ...
- Once your PR is ready for CI, send a message in the
ci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)
VocabVictor pushed a commit to VocabVictor/verl-plus that referenced this pull request
What does this PR do?
Add the TIS support from verl-project/verl#2953 to megatron actor
Checklist Before Starting
- Search for similar PRs. Paste at least one query link here: ...
- Format the PR title as
[{modules}] {type}: {description}(This will be checked by the CI) {modules}includefsdp,megatron,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data- If this PR involves multiple modules, separate them with
,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add
[BREAKING]to the beginning of the title.- Example:
[BREAKING][fsdp, megatron] feat: dynamic batching
- Example:
Test
For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.
API and Usage Example
Demonstrate how the API changes if any, and provide usage example(s) if possible.
# Add code snippet or script demonstrating how to use thisDesign & Code Changes
Demonstrate the high-level design if this PR is complex, and list the specific changes.
Checklist Before Submitting
[!IMPORTANT] Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
- Read the Contribute Guide.
- Apply pre-commit
checks:
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always - Add / Update the documentation.
- Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: ...
- Once your PR is ready for CI, send a message in the
ci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)
Co-authored-by: Shuang Yu shuangy@shuangy-mlt.client.nvidia.com
fabianlim pushed a commit to fabianlim/verl that referenced this pull request
…ncated importance sampling (verl-project#2953)
Support vLLM-FSDP off-policy importance sampling correction using Truncated Importance Sampling (TIS):
- Search for similar PRs. Paste at least one query link here: ...
- Format the PR title as
[{modules}] {type}: {description}(This will be checked by the CI) {modules}includefsdp,megatron,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data- If this PR involves multiple modules, separate them with
,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add
[BREAKING]to the beginning of the title.- Example:
[BREAKING][fsdp, megatron] feat: dynamic batching
- Example:
For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.
Demonstrate how the API changes if any, and provide usage example(s) if possible.
python3 -m verl.trainer.main_ppo \
algorithm.adv_estimator=gae \
data.train_files="$train_files" \
data.val_files="$test_files" \
data.train_batch_size=1024 \
data.max_prompt_length=1024 \
data.max_response_length=1024 \
data.filter_overlong_prompts=True \
data.truncation='error' \
actor_rollout_ref.model.path=Qwen/Qwen2.5-32B-Instruct \
actor_rollout_ref.model.enable_gradient_checkpointing=False \
actor_rollout_ref.actor.optim.lr=1e-6 \
actor_rollout_ref.model.use_remove_padding=True \
actor_rollout_ref.actor.ppo_mini_batch_size=256 \
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=8 \
actor_rollout_ref.model.enable_gradient_checkpointing=True \
actor_rollout_ref.actor.fsdp_config.param_offload=False \
actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
actor_rollout_ref.actor.use_kl_loss=False \
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=16 \
actor_rollout_ref.rollout.tensor_model_parallel_size=4 \
actor_rollout_ref.rollout.name=vllm \
actor_rollout_ref.rollout.gpu_memory_utilization=0.5 \
critic.optim.lr=1e-5 \
critic.model.use_remove_padding=True \
critic.model.path=Qwen/Qwen2.5-32B-Instruct \
critic.model.enable_gradient_checkpointing=False \
critic.ppo_micro_batch_size_per_gpu=8 \
critic.model.fsdp_config.param_offload=False \
critic.model.fsdp_config.optimizer_offload=False \
algorithm.use_kl_in_reward=False \
trainer.critic_warmup=0 \
trainer.logger='["console","wandb"]' \
trainer.project_name='verl_example' \
trainer.experiment_name='Qwen2.5-32B-Instruct_function_rm' \
trainer.n_gpus_per_node=8 \
trainer.nnodes=4 \
trainer.save_freq=20 \
trainer.test_freq=10 \
trainer.total_epochs=15 \
actor_rollout_ref.rollout.calculate_log_probs=True \ # add this config to return rollout prob
+actor_rollout_ref.actor.behav_imp_weight_cap=10.0$@ # add this config to set up C value in TISDemonstrate the high-level design if this PR is complex, and list the specific changes.
[!IMPORTANT] Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
- Read the Contribute Guide.
- Apply pre-commit
checks:
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always - Add / Update the documentation.
- Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: ...
- Once your PR is ready for CI, send a message in the
ci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)
Co-authored-by: Narsil-Dinghuai Zhang 张鼎怀 dinghuai233@gmail.com Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: LiyuanLucasLiu llychinalz@gmail.com
WncFht pushed a commit to WncFht/verl that referenced this pull request
…ncated importance sampling (verl-project#2953)
What does this PR do?
Support vLLM-FSDP off-policy importance sampling correction using Truncated Importance Sampling (TIS):
Checklist Before Starting
- Search for similar PRs. Paste at least one query link here: ...
- Format the PR title as
[{modules}] {type}: {description}(This will be checked by the CI) {modules}includefsdp,megatron,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data- If this PR involves multiple modules, separate them with
,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add
[BREAKING]to the beginning of the title.- Example:
[BREAKING][fsdp, megatron] feat: dynamic batching
- Example:
Test
For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.
API and Usage Example
Demonstrate how the API changes if any, and provide usage example(s) if possible.
python3 -m verl.trainer.main_ppo \
algorithm.adv_estimator=gae \
data.train_files="$train_files" \
data.val_files="$test_files" \
data.train_batch_size=1024 \
data.max_prompt_length=1024 \
data.max_response_length=1024 \
data.filter_overlong_prompts=True \
data.truncation='error' \
actor_rollout_ref.model.path=Qwen/Qwen2.5-32B-Instruct \
actor_rollout_ref.model.enable_gradient_checkpointing=False \
actor_rollout_ref.actor.optim.lr=1e-6 \
actor_rollout_ref.model.use_remove_padding=True \
actor_rollout_ref.actor.ppo_mini_batch_size=256 \
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=8 \
actor_rollout_ref.model.enable_gradient_checkpointing=True \
actor_rollout_ref.actor.fsdp_config.param_offload=False \
actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
actor_rollout_ref.actor.use_kl_loss=False \
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=16 \
actor_rollout_ref.rollout.tensor_model_parallel_size=4 \
actor_rollout_ref.rollout.name=vllm \
actor_rollout_ref.rollout.gpu_memory_utilization=0.5 \
critic.optim.lr=1e-5 \
critic.model.use_remove_padding=True \
critic.model.path=Qwen/Qwen2.5-32B-Instruct \
critic.model.enable_gradient_checkpointing=False \
critic.ppo_micro_batch_size_per_gpu=8 \
critic.model.fsdp_config.param_offload=False \
critic.model.fsdp_config.optimizer_offload=False \
algorithm.use_kl_in_reward=False \
trainer.critic_warmup=0 \
trainer.logger='["console","wandb"]' \
trainer.project_name='verl_example' \
trainer.experiment_name='Qwen2.5-32B-Instruct_function_rm' \
trainer.n_gpus_per_node=8 \
trainer.nnodes=4 \
trainer.save_freq=20 \
trainer.test_freq=10 \
trainer.total_epochs=15 \
actor_rollout_ref.rollout.calculate_log_probs=True \ # add this config to return rollout prob
+actor_rollout_ref.actor.behav_imp_weight_cap=10.0$@ # add this config to set up C value in TISDesign & Code Changes
Demonstrate the high-level design if this PR is complex, and list the specific changes.
Checklist Before Submitting
[!IMPORTANT] Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
- Read the Contribute Guide.
- Apply pre-commit
checks:
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always - Add / Update the documentation.
- Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: ...
- Once your PR is ready for CI, send a message in the
ci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)
Co-authored-by: Narsil-Dinghuai Zhang 张鼎怀 dinghuai233@gmail.com Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: LiyuanLucasLiu llychinalz@gmail.com
masoudhashemi pushed a commit to masoudhashemi/verl that referenced this pull request
… functions (verl-project#3274)
What does this PR do?
In the recent PR:
the file workers/actor/dp_actor.py was updated so that
rollout_log_probs is passed to policy_loss_fn:
In that PR, the "vanilla" policy loss function was modified to accept
rollout_log_probs as an argument. However, other policy loss functions
(e.g., "gspo") were not updated accordingly, which leads to an error
such as:
TypeError: compute_policy_loss_gspo() got an unexpected keyword argument 'rollout_log_probs'when setting config.policy_loss.loss_mode to one of these
alternatives.
Therefore, in this PR, rollout_log_probs is also added as an argument
to the other policy loss functions.
Checklist Before Starting
- Search for similar PRs. Paste at least one query link here: ...
- Format the PR title as
[{modules}] {type}: {description}(This will be checked by the CI) {modules}includefsdp,megatron,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data- If this PR involves multiple modules, separate them with
,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add
[BREAKING]to the beginning of the title.- Example:
[BREAKING][fsdp, megatron] feat: dynamic batching
- Example:
Test
For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.
API and Usage Example
Demonstrate how the API changes if any, and provide usage example(s) if possible.
# Add code snippet or script demonstrating how to use thisDesign & Code Changes
Demonstrate the high-level design if this PR is complex, and list the specific changes.
Checklist Before Submitting
[!IMPORTANT] Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
- Read the Contribute Guide.
- Apply pre-commit
checks:
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always - Add / Update the documentation.
- Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: ...
- Once your PR is ready for CI, send a message in the
ci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)
masoudhashemi pushed a commit to masoudhashemi/verl that referenced this pull request
What does this PR do?
Add the TIS support from verl-project#2953 to megatron actor
Checklist Before Starting
- Search for similar PRs. Paste at least one query link here: ...
- Format the PR title as
[{modules}] {type}: {description}(This will be checked by the CI) {modules}includefsdp,megatron,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data- If this PR involves multiple modules, separate them with
,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add
[BREAKING]to the beginning of the title.- Example:
[BREAKING][fsdp, megatron] feat: dynamic batching
- Example:
Test
For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.
API and Usage Example
Demonstrate how the API changes if any, and provide usage example(s) if possible.
# Add code snippet or script demonstrating how to use thisDesign & Code Changes
Demonstrate the high-level design if this PR is complex, and list the specific changes.
Checklist Before Submitting
[!IMPORTANT] Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
- Read the Contribute Guide.
- Apply pre-commit
checks:
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always - Add / Update the documentation.
- Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: ...
- Once your PR is ready for CI, send a message in the
ci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)
Co-authored-by: Shuang Yu shuangy@shuangy-mlt.client.nvidia.com
techkang pushed a commit to techkang/verl that referenced this pull request
…ncated importance sampling (verl-project#2953)
What does this PR do?
Support vLLM-FSDP off-policy importance sampling correction using Truncated Importance Sampling (TIS):
Checklist Before Starting
- Search for similar PRs. Paste at least one query link here: ...
- Format the PR title as
[{modules}] {type}: {description}(This will be checked by the CI) {modules}includefsdp,megatron,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data- If this PR involves multiple modules, separate them with
,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add
[BREAKING]to the beginning of the title.- Example:
[BREAKING][fsdp, megatron] feat: dynamic batching
- Example:
Test
For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.
API and Usage Example
Demonstrate how the API changes if any, and provide usage example(s) if possible.
python3 -m verl.trainer.main_ppo \
algorithm.adv_estimator=gae \
data.train_files="$train_files" \
data.val_files="$test_files" \
data.train_batch_size=1024 \
data.max_prompt_length=1024 \
data.max_response_length=1024 \
data.filter_overlong_prompts=True \
data.truncation='error' \
actor_rollout_ref.model.path=Qwen/Qwen2.5-32B-Instruct \
actor_rollout_ref.model.enable_gradient_checkpointing=False \
actor_rollout_ref.actor.optim.lr=1e-6 \
actor_rollout_ref.model.use_remove_padding=True \
actor_rollout_ref.actor.ppo_mini_batch_size=256 \
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=8 \
actor_rollout_ref.model.enable_gradient_checkpointing=True \
actor_rollout_ref.actor.fsdp_config.param_offload=False \
actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
actor_rollout_ref.actor.use_kl_loss=False \
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=16 \
actor_rollout_ref.rollout.tensor_model_parallel_size=4 \
actor_rollout_ref.rollout.name=vllm \
actor_rollout_ref.rollout.gpu_memory_utilization=0.5 \
critic.optim.lr=1e-5 \
critic.model.use_remove_padding=True \
critic.model.path=Qwen/Qwen2.5-32B-Instruct \
critic.model.enable_gradient_checkpointing=False \
critic.ppo_micro_batch_size_per_gpu=8 \
critic.model.fsdp_config.param_offload=False \
critic.model.fsdp_config.optimizer_offload=False \
algorithm.use_kl_in_reward=False \
trainer.critic_warmup=0 \
trainer.logger='["console","wandb"]' \
trainer.project_name='verl_example' \
trainer.experiment_name='Qwen2.5-32B-Instruct_function_rm' \
trainer.n_gpus_per_node=8 \
trainer.nnodes=4 \
trainer.save_freq=20 \
trainer.test_freq=10 \
trainer.total_epochs=15 \
actor_rollout_ref.rollout.calculate_log_probs=True \ # add this config to return rollout prob
+actor_rollout_ref.actor.behav_imp_weight_cap=10.0$@ # add this config to set up C value in TISDesign & Code Changes
Demonstrate the high-level design if this PR is complex, and list the specific changes.
Checklist Before Submitting
[!IMPORTANT] Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
- Read the Contribute Guide.
- Apply pre-commit
checks:
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always - Add / Update the documentation.
- Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: ...
- Once your PR is ready for CI, send a message in the
ci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)
Co-authored-by: Narsil-Dinghuai Zhang 张鼎怀 dinghuai233@gmail.com Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: LiyuanLucasLiu llychinalz@gmail.com
techkang pushed a commit to techkang/verl that referenced this pull request
… functions (verl-project#3274)
What does this PR do?
In the recent PR:
the file workers/actor/dp_actor.py was updated so that
rollout_log_probs is passed to policy_loss_fn:
In that PR, the "vanilla" policy loss function was modified to accept
rollout_log_probs as an argument. However, other policy loss functions
(e.g., "gspo") were not updated accordingly, which leads to an error
such as:
TypeError: compute_policy_loss_gspo() got an unexpected keyword argument 'rollout_log_probs'when setting config.policy_loss.loss_mode to one of these
alternatives.
Therefore, in this PR, rollout_log_probs is also added as an argument
to the other policy loss functions.
Checklist Before Starting
- Search for similar PRs. Paste at least one query link here: ...
- Format the PR title as
[{modules}] {type}: {description}(This will be checked by the CI) {modules}includefsdp,megatron,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data- If this PR involves multiple modules, separate them with
,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add
[BREAKING]to the beginning of the title.- Example:
[BREAKING][fsdp, megatron] feat: dynamic batching
- Example:
Test
For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.
API and Usage Example
Demonstrate how the API changes if any, and provide usage example(s) if possible.
# Add code snippet or script demonstrating how to use thisDesign & Code Changes
Demonstrate the high-level design if this PR is complex, and list the specific changes.
Checklist Before Submitting
[!IMPORTANT] Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
- Read the Contribute Guide.
- Apply pre-commit
checks:
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always - Add / Update the documentation.
- Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: ...
- Once your PR is ready for CI, send a message in the
ci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)
techkang pushed a commit to techkang/verl that referenced this pull request
What does this PR do?
Add the TIS support from verl-project#2953 to megatron actor
Checklist Before Starting
- Search for similar PRs. Paste at least one query link here: ...
- Format the PR title as
[{modules}] {type}: {description}(This will be checked by the CI) {modules}includefsdp,megatron,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data- If this PR involves multiple modules, separate them with
,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add
[BREAKING]to the beginning of the title.- Example:
[BREAKING][fsdp, megatron] feat: dynamic batching
- Example:
Test
For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.
API and Usage Example
Demonstrate how the API changes if any, and provide usage example(s) if possible.
# Add code snippet or script demonstrating how to use thisDesign & Code Changes
Demonstrate the high-level design if this PR is complex, and list the specific changes.
Checklist Before Submitting
[!IMPORTANT] Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
- Read the Contribute Guide.
- Apply pre-commit
checks:
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always - Add / Update the documentation.
- Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: ...
- Once your PR is ready for CI, send a message in the
ci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)
Co-authored-by: Shuang Yu shuangy@shuangy-mlt.client.nvidia.com
mtian8 pushed a commit to mtian8/verl that referenced this pull request
What does this PR do?
Add the TIS support from verl-project#2953 to megatron actor
Checklist Before Starting
- Search for similar PRs. Paste at least one query link here: ...
- Format the PR title as
[{modules}] {type}: {description}(This will be checked by the CI) {modules}includefsdp,megatron,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data- If this PR involves multiple modules, separate them with
,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add
[BREAKING]to the beginning of the title.- Example:
[BREAKING][fsdp, megatron] feat: dynamic batching
- Example:
Test
For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.
API and Usage Example
Demonstrate how the API changes if any, and provide usage example(s) if possible.
# Add code snippet or script demonstrating how to use thisDesign & Code Changes
Demonstrate the high-level design if this PR is complex, and list the specific changes.
Checklist Before Submitting
[!IMPORTANT] Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
- Read the Contribute Guide.
- Apply pre-commit
checks:
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always - Add / Update the documentation.
- Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: ...
- Once your PR is ready for CI, send a message in the
ci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)
Co-authored-by: Shuang Yu shuangy@shuangy-mlt.client.nvidia.com
chenjiaoAngel added a commit to chenjiaoAngel/verl that referenced this pull request
…ncated importance sampling (verl-project#2953)
What does this PR do?
Support vLLM-FSDP off-policy importance sampling correction using Truncated Importance Sampling (TIS):
Checklist Before Starting
- Search for similar PRs. Paste at least one query link here: ...
- Format the PR title as
[{modules}] {type}: {description}(This will be checked by the CI) {modules}includefsdp,megatron,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data- If this PR involves multiple modules, separate them with
,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add
[BREAKING]to the beginning of the title.- Example:
[BREAKING][fsdp, megatron] feat: dynamic batching
- Example:
Test
For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.
API and Usage Example
Demonstrate how the API changes if any, and provide usage example(s) if possible.
python3 -m verl.trainer.main_ppo \
algorithm.adv_estimator=gae \
data.train_files="$train_files" \
data.val_files="$test_files" \
data.train_batch_size=1024 \
data.max_prompt_length=1024 \
data.max_response_length=1024 \
data.filter_overlong_prompts=True \
data.truncation='error' \
actor_rollout_ref.model.path=Qwen/Qwen2.5-32B-Instruct \
actor_rollout_ref.model.enable_gradient_checkpointing=False \
actor_rollout_ref.actor.optim.lr=1e-6 \
actor_rollout_ref.model.use_remove_padding=True \
actor_rollout_ref.actor.ppo_mini_batch_size=256 \
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=8 \
actor_rollout_ref.model.enable_gradient_checkpointing=True \
actor_rollout_ref.actor.fsdp_config.param_offload=False \
actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
actor_rollout_ref.actor.use_kl_loss=False \
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=16 \
actor_rollout_ref.rollout.tensor_model_parallel_size=4 \
actor_rollout_ref.rollout.name=vllm \
actor_rollout_ref.rollout.gpu_memory_utilization=0.5 \
critic.optim.lr=1e-5 \
critic.model.use_remove_padding=True \
critic.model.path=Qwen/Qwen2.5-32B-Instruct \
critic.model.enable_gradient_checkpointing=False \
critic.ppo_micro_batch_size_per_gpu=8 \
critic.model.fsdp_config.param_offload=False \
critic.model.fsdp_config.optimizer_offload=False \
algorithm.use_kl_in_reward=False \
trainer.critic_warmup=0 \
trainer.logger='["console","wandb"]' \
trainer.project_name='verl_example' \
trainer.experiment_name='Qwen2.5-32B-Instruct_function_rm' \
trainer.n_gpus_per_node=8 \
trainer.nnodes=4 \
trainer.save_freq=20 \
trainer.test_freq=10 \
trainer.total_epochs=15 \
actor_rollout_ref.rollout.calculate_log_probs=True \ # add this config to return rollout prob
+actor_rollout_ref.actor.behav_imp_weight_cap=10.0$@ # add this config to set up C value in TISDesign & Code Changes
Demonstrate the high-level design if this PR is complex, and list the specific changes.
Checklist Before Submitting
[!IMPORTANT] Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
- Read the Contribute Guide.
- Apply pre-commit
checks:
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always - Add / Update the documentation.
- Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: ...
- Once your PR is ready for CI, send a message in the
ci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)
Co-authored-by: Narsil-Dinghuai Zhang 张鼎怀 dinghuai233@gmail.com Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: LiyuanLucasLiu llychinalz@gmail.com
chenjiaoAngel added a commit to chenjiaoAngel/verl that referenced this pull request
… functions (verl-project#3274)
What does this PR do?
In the recent PR:
the file workers/actor/dp_actor.py was updated so that
rollout_log_probs is passed to policy_loss_fn:
In that PR, the "vanilla" policy loss function was modified to accept
rollout_log_probs as an argument. However, other policy loss functions
(e.g., "gspo") were not updated accordingly, which leads to an error
such as:
TypeError: compute_policy_loss_gspo() got an unexpected keyword argument 'rollout_log_probs'when setting config.policy_loss.loss_mode to one of these
alternatives.
Therefore, in this PR, rollout_log_probs is also added as an argument
to the other policy loss functions.
Checklist Before Starting
- Search for similar PRs. Paste at least one query link here: ...
- Format the PR title as
[{modules}] {type}: {description}(This will be checked by the CI) {modules}includefsdp,megatron,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data- If this PR involves multiple modules, separate them with
,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add
[BREAKING]to the beginning of the title.- Example:
[BREAKING][fsdp, megatron] feat: dynamic batching
- Example:
Test
For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.
API and Usage Example
Demonstrate how the API changes if any, and provide usage example(s) if possible.
# Add code snippet or script demonstrating how to use thisDesign & Code Changes
Demonstrate the high-level design if this PR is complex, and list the specific changes.
Checklist Before Submitting
[!IMPORTANT] Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
- Read the Contribute Guide.
- Apply pre-commit
checks:
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always - Add / Update the documentation.
- Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: ...
- Once your PR is ready for CI, send a message in the
ci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)
chenjiaoAngel added a commit to chenjiaoAngel/verl that referenced this pull request
What does this PR do?
Add the TIS support from verl-project#2953 to megatron actor
Checklist Before Starting
- Search for similar PRs. Paste at least one query link here: ...
- Format the PR title as
[{modules}] {type}: {description}(This will be checked by the CI) {modules}includefsdp,megatron,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data- If this PR involves multiple modules, separate them with
,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add
[BREAKING]to the beginning of the title.- Example:
[BREAKING][fsdp, megatron] feat: dynamic batching
- Example:
Test
For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.
API and Usage Example
Demonstrate how the API changes if any, and provide usage example(s) if possible.
# Add code snippet or script demonstrating how to use thisDesign & Code Changes
Demonstrate the high-level design if this PR is complex, and list the specific changes.
Checklist Before Submitting
[!IMPORTANT] Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
- Read the Contribute Guide.
- Apply pre-commit
checks:
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always - Add / Update the documentation.
- Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: ...
- Once your PR is ready for CI, send a message in the
ci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)
Co-authored-by: Shuang Yu shuangy@shuangy-mlt.client.nvidia.com
NenoL2001 pushed a commit to NenoL2001/verl that referenced this pull request
What does this PR do?
Add the TIS support from verl-project#2953 to megatron actor
Checklist Before Starting
- Search for similar PRs. Paste at least one query link here: ...
- Format the PR title as
[{modules}] {type}: {description}(This will be checked by the CI) {modules}includefsdp,megatron,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data- If this PR involves multiple modules, separate them with
,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add
[BREAKING]to the beginning of the title.- Example:
[BREAKING][fsdp, megatron] feat: dynamic batching
- Example:
Test
For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.
API and Usage Example
Demonstrate how the API changes if any, and provide usage example(s) if possible.
# Add code snippet or script demonstrating how to use thisDesign & Code Changes
Demonstrate the high-level design if this PR is complex, and list the specific changes.
Checklist Before Submitting
[!IMPORTANT] Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
- Read the Contribute Guide.
- Apply pre-commit
checks:
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always - Add / Update the documentation.
- Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: ...
- Once your PR is ready for CI, send a message in the
ci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)
Co-authored-by: Shuang Yu shuangy@shuangy-mlt.client.nvidia.com
paolo328 added a commit to paolo328/Verl that referenced this pull request
… functions (#3274)
What does this PR do?
In the recent PR:
the file workers/actor/dp_actor.py was updated so that
rollout_log_probs is passed to policy_loss_fn:
In that PR, the "vanilla" policy loss function was modified to accept
rollout_log_probs as an argument. However, other policy loss functions
(e.g., "gspo") were not updated accordingly, which leads to an error
such as:
TypeError: compute_policy_loss_gspo() got an unexpected keyword argument 'rollout_log_probs'when setting config.policy_loss.loss_mode to one of these
alternatives.
Therefore, in this PR, rollout_log_probs is also added as an argument
to the other policy loss functions.
Checklist Before Starting
- Search for similar PRs. Paste at least one query link here: ...
- Format the PR title as
[{modules}] {type}: {description}(This will be checked by the CI) {modules}includefsdp,megatron,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data- If this PR involves multiple modules, separate them with
,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add
[BREAKING]to the beginning of the title.- Example:
[BREAKING][fsdp, megatron] feat: dynamic batching
- Example:
Test
For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.
API and Usage Example
Demonstrate how the API changes if any, and provide usage example(s) if possible.
# Add code snippet or script demonstrating how to use thisDesign & Code Changes
Demonstrate the high-level design if this PR is complex, and list the specific changes.
Checklist Before Submitting
[!IMPORTANT] Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
- Read the Contribute Guide.
- Apply pre-commit
checks:
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always - Add / Update the documentation.
- Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: ...
- Once your PR is ready for CI, send a message in the
ci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)
paolo328 added a commit to paolo328/Verl that referenced this pull request
What does this PR do?
Add the TIS support from verl-project/verl#2953 to megatron actor
Checklist Before Starting
- Search for similar PRs. Paste at least one query link here: ...
- Format the PR title as
[{modules}] {type}: {description}(This will be checked by the CI) {modules}includefsdp,megatron,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data- If this PR involves multiple modules, separate them with
,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add
[BREAKING]to the beginning of the title.- Example:
[BREAKING][fsdp, megatron] feat: dynamic batching
- Example:
Test
For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.
API and Usage Example
Demonstrate how the API changes if any, and provide usage example(s) if possible.
# Add code snippet or script demonstrating how to use thisDesign & Code Changes
Demonstrate the high-level design if this PR is complex, and list the specific changes.
Checklist Before Submitting
[!IMPORTANT] Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
- Read the Contribute Guide.
- Apply pre-commit
checks:
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always - Add / Update the documentation.
- Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: ...
- Once your PR is ready for CI, send a message in the
ci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)
Co-authored-by: Shuang Yu shuangy@shuangy-mlt.client.nvidia.com
garrett361 pushed a commit to garrett361/verl that referenced this pull request
…ncated importance sampling (verl-project#2953)
Support vLLM-FSDP off-policy importance sampling correction using Truncated Importance Sampling (TIS):
- Search for similar PRs. Paste at least one query link here: ...
- Format the PR title as
[{modules}] {type}: {description}(This will be checked by the CI) {modules}includefsdp,megatron,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data- If this PR involves multiple modules, separate them with
,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add
[BREAKING]to the beginning of the title.- Example:
[BREAKING][fsdp, megatron] feat: dynamic batching
- Example:
For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.
Demonstrate how the API changes if any, and provide usage example(s) if possible.
python3 -m verl.trainer.main_ppo \
algorithm.adv_estimator=gae \
data.train_files="$train_files" \
data.val_files="$test_files" \
data.train_batch_size=1024 \
data.max_prompt_length=1024 \
data.max_response_length=1024 \
data.filter_overlong_prompts=True \
data.truncation='error' \
actor_rollout_ref.model.path=Qwen/Qwen2.5-32B-Instruct \
actor_rollout_ref.model.enable_gradient_checkpointing=False \
actor_rollout_ref.actor.optim.lr=1e-6 \
actor_rollout_ref.model.use_remove_padding=True \
actor_rollout_ref.actor.ppo_mini_batch_size=256 \
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=8 \
actor_rollout_ref.model.enable_gradient_checkpointing=True \
actor_rollout_ref.actor.fsdp_config.param_offload=False \
actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
actor_rollout_ref.actor.use_kl_loss=False \
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=16 \
actor_rollout_ref.rollout.tensor_model_parallel_size=4 \
actor_rollout_ref.rollout.name=vllm \
actor_rollout_ref.rollout.gpu_memory_utilization=0.5 \
critic.optim.lr=1e-5 \
critic.model.use_remove_padding=True \
critic.model.path=Qwen/Qwen2.5-32B-Instruct \
critic.model.enable_gradient_checkpointing=False \
critic.ppo_micro_batch_size_per_gpu=8 \
critic.model.fsdp_config.param_offload=False \
critic.model.fsdp_config.optimizer_offload=False \
algorithm.use_kl_in_reward=False \
trainer.critic_warmup=0 \
trainer.logger='["console","wandb"]' \
trainer.project_name='verl_example' \
trainer.experiment_name='Qwen2.5-32B-Instruct_function_rm' \
trainer.n_gpus_per_node=8 \
trainer.nnodes=4 \
trainer.save_freq=20 \
trainer.test_freq=10 \
trainer.total_epochs=15 \
actor_rollout_ref.rollout.calculate_log_probs=True \ # add this config to return rollout prob
+actor_rollout_ref.actor.behav_imp_weight_cap=10.0$@ # add this config to set up C value in TISDemonstrate the high-level design if this PR is complex, and list the specific changes.
[!IMPORTANT] Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
- Read the Contribute Guide.
- Apply pre-commit
checks:
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always - Add / Update the documentation.
- Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: ...
- Once your PR is ready for CI, send a message in the
ci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)
Co-authored-by: Narsil-Dinghuai Zhang 张鼎怀 dinghuai233@gmail.com Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: LiyuanLucasLiu llychinalz@gmail.com
TimurTaepov pushed a commit to giorgossideris/verl that referenced this pull request
…ncated importance sampling (verl-project#2953)
What does this PR do?
Support vLLM-FSDP off-policy importance sampling correction using Truncated Importance Sampling (TIS):
Checklist Before Starting
- Search for similar PRs. Paste at least one query link here: ...
- Format the PR title as
[{modules}] {type}: {description}(This will be checked by the CI) {modules}includefsdp,megatron,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data- If this PR involves multiple modules, separate them with
,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add
[BREAKING]to the beginning of the title.- Example:
[BREAKING][fsdp, megatron] feat: dynamic batching
- Example:
Test
For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.
API and Usage Example
Demonstrate how the API changes if any, and provide usage example(s) if possible.
python3 -m verl.trainer.main_ppo \
algorithm.adv_estimator=gae \
data.train_files="$train_files" \
data.val_files="$test_files" \
data.train_batch_size=1024 \
data.max_prompt_length=1024 \
data.max_response_length=1024 \
data.filter_overlong_prompts=True \
data.truncation='error' \
actor_rollout_ref.model.path=Qwen/Qwen2.5-32B-Instruct \
actor_rollout_ref.model.enable_gradient_checkpointing=False \
actor_rollout_ref.actor.optim.lr=1e-6 \
actor_rollout_ref.model.use_remove_padding=True \
actor_rollout_ref.actor.ppo_mini_batch_size=256 \
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=8 \
actor_rollout_ref.model.enable_gradient_checkpointing=True \
actor_rollout_ref.actor.fsdp_config.param_offload=False \
actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
actor_rollout_ref.actor.use_kl_loss=False \
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=16 \
actor_rollout_ref.rollout.tensor_model_parallel_size=4 \
actor_rollout_ref.rollout.name=vllm \
actor_rollout_ref.rollout.gpu_memory_utilization=0.5 \
critic.optim.lr=1e-5 \
critic.model.use_remove_padding=True \
critic.model.path=Qwen/Qwen2.5-32B-Instruct \
critic.model.enable_gradient_checkpointing=False \
critic.ppo_micro_batch_size_per_gpu=8 \
critic.model.fsdp_config.param_offload=False \
critic.model.fsdp_config.optimizer_offload=False \
algorithm.use_kl_in_reward=False \
trainer.critic_warmup=0 \
trainer.logger='["console","wandb"]' \
trainer.project_name='verl_example' \
trainer.experiment_name='Qwen2.5-32B-Instruct_function_rm' \
trainer.n_gpus_per_node=8 \
trainer.nnodes=4 \
trainer.save_freq=20 \
trainer.test_freq=10 \
trainer.total_epochs=15 \
actor_rollout_ref.rollout.calculate_log_probs=True \ # add this config to return rollout prob
+actor_rollout_ref.actor.behav_imp_weight_cap=10.0$@ # add this config to set up C value in TISDesign & Code Changes
Demonstrate the high-level design if this PR is complex, and list the specific changes.
Checklist Before Submitting
[!IMPORTANT] Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
- Read the Contribute Guide.
- Apply pre-commit
checks:
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always - Add / Update the documentation.
- Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: ...
- Once your PR is ready for CI, send a message in the
ci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)
Co-authored-by: Narsil-Dinghuai Zhang 张鼎怀 dinghuai233@gmail.com Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: LiyuanLucasLiu llychinalz@gmail.com
TimurTaepov pushed a commit to giorgossideris/verl that referenced this pull request
… functions (verl-project#3274)
What does this PR do?
In the recent PR:
the file workers/actor/dp_actor.py was updated so that
rollout_log_probs is passed to policy_loss_fn:
In that PR, the "vanilla" policy loss function was modified to accept
rollout_log_probs as an argument. However, other policy loss functions
(e.g., "gspo") were not updated accordingly, which leads to an error
such as:
TypeError: compute_policy_loss_gspo() got an unexpected keyword argument 'rollout_log_probs'when setting config.policy_loss.loss_mode to one of these
alternatives.
Therefore, in this PR, rollout_log_probs is also added as an argument
to the other policy loss functions.
Checklist Before Starting
- Search for similar PRs. Paste at least one query link here: ...
- Format the PR title as
[{modules}] {type}: {description}(This will be checked by the CI) {modules}includefsdp,megatron,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data- If this PR involves multiple modules, separate them with
,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add
[BREAKING]to the beginning of the title.- Example:
[BREAKING][fsdp, megatron] feat: dynamic batching
- Example:
Test
For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.
API and Usage Example
Demonstrate how the API changes if any, and provide usage example(s) if possible.
# Add code snippet or script demonstrating how to use thisDesign & Code Changes
Demonstrate the high-level design if this PR is complex, and list the specific changes.
Checklist Before Submitting
[!IMPORTANT] Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
- Read the Contribute Guide.
- Apply pre-commit
checks:
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always - Add / Update the documentation.
- Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: ...
- Once your PR is ready for CI, send a message in the
ci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)
TimurTaepov pushed a commit to giorgossideris/verl that referenced this pull request
What does this PR do?
Add the TIS support from verl-project#2953 to megatron actor
Checklist Before Starting
- Search for similar PRs. Paste at least one query link here: ...
- Format the PR title as
[{modules}] {type}: {description}(This will be checked by the CI) {modules}includefsdp,megatron,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data- If this PR involves multiple modules, separate them with
,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add
[BREAKING]to the beginning of the title.- Example:
[BREAKING][fsdp, megatron] feat: dynamic batching
- Example:
Test
For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.
API and Usage Example
Demonstrate how the API changes if any, and provide usage example(s) if possible.
# Add code snippet or script demonstrating how to use thisDesign & Code Changes
Demonstrate the high-level design if this PR is complex, and list the specific changes.
Checklist Before Submitting
[!IMPORTANT] Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
- Read the Contribute Guide.
- Apply pre-commit
checks:
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always - Add / Update the documentation.
- Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: ...
- Once your PR is ready for CI, send a message in the
ci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)
Co-authored-by: Shuang Yu shuangy@shuangy-mlt.client.nvidia.com
vyomakesh0728 added a commit to vyomakesh0728/verl that referenced this pull request
…ncated importance sampling (verl-project#2953)
What does this PR do?
Support vLLM-FSDP off-policy importance sampling correction using Truncated Importance Sampling (TIS):
Checklist Before Starting
- Search for similar PRs. Paste at least one query link here: ...
- Format the PR title as
[{modules}] {type}: {description}(This will be checked by the CI) {modules}includefsdp,megatron,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data- If this PR involves multiple modules, separate them with
,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add
[BREAKING]to the beginning of the title.- Example:
[BREAKING][fsdp, megatron] feat: dynamic batching
- Example:
Test
For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.
API and Usage Example
Demonstrate how the API changes if any, and provide usage example(s) if possible.
python3 -m verl.trainer.main_ppo \
algorithm.adv_estimator=gae \
data.train_files="$train_files" \
data.val_files="$test_files" \
data.train_batch_size=1024 \
data.max_prompt_length=1024 \
data.max_response_length=1024 \
data.filter_overlong_prompts=True \
data.truncation='error' \
actor_rollout_ref.model.path=Qwen/Qwen2.5-32B-Instruct \
actor_rollout_ref.model.enable_gradient_checkpointing=False \
actor_rollout_ref.actor.optim.lr=1e-6 \
actor_rollout_ref.model.use_remove_padding=True \
actor_rollout_ref.actor.ppo_mini_batch_size=256 \
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=8 \
actor_rollout_ref.model.enable_gradient_checkpointing=True \
actor_rollout_ref.actor.fsdp_config.param_offload=False \
actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
actor_rollout_ref.actor.use_kl_loss=False \
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=16 \
actor_rollout_ref.rollout.tensor_model_parallel_size=4 \
actor_rollout_ref.rollout.name=vllm \
actor_rollout_ref.rollout.gpu_memory_utilization=0.5 \
critic.optim.lr=1e-5 \
critic.model.use_remove_padding=True \
critic.model.path=Qwen/Qwen2.5-32B-Instruct \
critic.model.enable_gradient_checkpointing=False \
critic.ppo_micro_batch_size_per_gpu=8 \
critic.model.fsdp_config.param_offload=False \
critic.model.fsdp_config.optimizer_offload=False \
algorithm.use_kl_in_reward=False \
trainer.critic_warmup=0 \
trainer.logger='["console","wandb"]' \
trainer.project_name='verl_example' \
trainer.experiment_name='Qwen2.5-32B-Instruct_function_rm' \
trainer.n_gpus_per_node=8 \
trainer.nnodes=4 \
trainer.save_freq=20 \
trainer.test_freq=10 \
trainer.total_epochs=15 \
actor_rollout_ref.rollout.calculate_log_probs=True \ # add this config to return rollout prob
+actor_rollout_ref.actor.behav_imp_weight_cap=10.0$@ # add this config to set up C value in TISDesign & Code Changes
Demonstrate the high-level design if this PR is complex, and list the specific changes.
Checklist Before Submitting
[!IMPORTANT] Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
- Read the Contribute Guide.
- Apply pre-commit
checks:
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always - Add / Update the documentation.
- Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: ...
- Once your PR is ready for CI, send a message in the
ci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)
Co-authored-by: Narsil-Dinghuai Zhang 张鼎怀 dinghuai233@gmail.com Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: LiyuanLucasLiu llychinalz@gmail.com
vyomakesh0728 added a commit to vyomakesh0728/verl that referenced this pull request
… functions (verl-project#3274)
What does this PR do?
In the recent PR:
the file workers/actor/dp_actor.py was updated so that
rollout_log_probs is passed to policy_loss_fn:
In that PR, the "vanilla" policy loss function was modified to accept
rollout_log_probs as an argument. However, other policy loss functions
(e.g., "gspo") were not updated accordingly, which leads to an error
such as:
TypeError: compute_policy_loss_gspo() got an unexpected keyword argument 'rollout_log_probs'when setting config.policy_loss.loss_mode to one of these
alternatives.
Therefore, in this PR, rollout_log_probs is also added as an argument
to the other policy loss functions.
Checklist Before Starting
- Search for similar PRs. Paste at least one query link here: ...
- Format the PR title as
[{modules}] {type}: {description}(This will be checked by the CI) {modules}includefsdp,megatron,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data- If this PR involves multiple modules, separate them with
,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add
[BREAKING]to the beginning of the title.- Example:
[BREAKING][fsdp, megatron] feat: dynamic batching
- Example:
Test
For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.
API and Usage Example
Demonstrate how the API changes if any, and provide usage example(s) if possible.
# Add code snippet or script demonstrating how to use thisDesign & Code Changes
Demonstrate the high-level design if this PR is complex, and list the specific changes.
Checklist Before Submitting
[!IMPORTANT] Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
- Read the Contribute Guide.
- Apply pre-commit
checks:
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always - Add / Update the documentation.
- Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: ...
- Once your PR is ready for CI, send a message in the
ci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)
vyomakesh0728 added a commit to vyomakesh0728/verl that referenced this pull request
What does this PR do?
Add the TIS support from verl-project#2953 to megatron actor
Checklist Before Starting
- Search for similar PRs. Paste at least one query link here: ...
- Format the PR title as
[{modules}] {type}: {description}(This will be checked by the CI) {modules}includefsdp,megatron,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data- If this PR involves multiple modules, separate them with
,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add
[BREAKING]to the beginning of the title.- Example:
[BREAKING][fsdp, megatron] feat: dynamic batching
- Example:
Test
For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.
API and Usage Example
Demonstrate how the API changes if any, and provide usage example(s) if possible.
# Add code snippet or script demonstrating how to use thisDesign & Code Changes
Demonstrate the high-level design if this PR is complex, and list the specific changes.
Checklist Before Submitting
[!IMPORTANT] Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
- Read the Contribute Guide.
- Apply pre-commit
checks:
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always - Add / Update the documentation.
- Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: ...
- Once your PR is ready for CI, send a message in the
ci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)
Co-authored-by: Shuang Yu shuangy@shuangy-mlt.client.nvidia.com
tongyx361 pushed a commit that referenced this pull request
What does this PR do?
Fix a long-standing bug in core_algos.kl_penalty that makes every
+-suffixed estimator name (k1+, kl+, abs+, k3+, low_var_kl+)
crash with NotImplementedError on the first training step that touches
KL — either through algorithm.kl_penalty (KL-in-reward) or
actor.kl_loss_type (actor KL loss).
The straight-through trick that the + suffix is supposed to enable
(unbiased k3 value with unbiased k2 gradient, see
Schulman, Approximating KL
divergence)
was added in #2953 but has never actually worked end-to-end because the
suffix was forwarded to kl_penalty_forward without being stripped,
causing the dispatch to fall through to raise NotImplementedError.
Checklist Before Starting
- Search for similar PRs. Paste at least one query link here:
is:pr kl_penalty k3+— 0 open, 2 closed (docstring typo and a rename, neither addresses this bug)is:pr kl_penalty endswith— 0 resultsis:pr kl_penalty_forward— 0 resultsis:issue "k3+"— no related issue- Format the PR title as
[{modules}] {type}: {description}—[algo] fix: …. Single-line behavior fix inverl/trainer/ppo/core_algos.py, no API surface change.
Test
Added two CPU regression tests in
tests/trainer/ppo/test_core_algos_on_cpu.py that get picked up
automatically by cpu_unit_tests.yml (tests/**/test_*_on_cpu.py):
test_kl_penalty_straight_through_value_matches_base— parametrized over(k1+, kl+, abs+, k3+, low_var_kl+), asserts that the suffixed estimator returns the same forward value as its base.test_kl_penalty_k3_plus_uses_k2_gradient— asserts that the gradient w.r.t.logprobproduced byk3+is exactly the gradient produced byk2, i.e. the straight-through trick is wired correctly.
Local run on top of the change:
$ pytest -xvs tests/trainer/ppo/test_core_algos_on_cpu.py
============================= test session starts ==============================
collected 22 items
tests/trainer/ppo/test_core_algos_on_cpu.py ...................... [100%]
============================== 22 passed in 4.79s ==============================Without the fix, every ..._straight_through_value_matches_base[*]
case raises NotImplementedError in kl_penalty_forward.
API and Usage Example
No API change. The fix simply makes the previously-documented +
suffix work as advertised. After this PR the following configurations
are usable (they currently crash on step 1):
# KL-in-reward with the straight-through trick:
python -m verl.trainer.main_ppo \
algorithm.use_kl_in_reward=True \
algorithm.kl_penalty=k3+ \
algorithm.kl_ctrl.kl_coef=1e-3
# Actor KL loss with the straight-through trick:
python -m verl.trainer.main_ppo \
actor_rollout_ref.actor.use_kl_loss=True \
actor_rollout_ref.actor.kl_loss_type=low_var_kl+ \
actor_rollout_ref.actor.kl_loss_coef=1e-3Design & Code Changes
Root cause is a single-line oversight in core_algos.kl_penalty:
# before
forward_score = kl_penalty_forward(logprob, ref_logprob, kl_penalty)
if not kl_penalty.endswith("+") or kl_penalty in ("mse", "k2"):
return forward_scorekl_penalty_forward only recognizes the bare names (k1, kl, abs,
mse, k2, k3, low_var_kl, full), so it falls through to
raise NotImplementedError whenever the input ends with +.
Fix: strip the optional suffix before dispatch, leaving the outer
endswith("+") switch untouched so the straight-through path is still
selected correctly:
# after
base_kl_penalty = kl_penalty[:-1] if kl_penalty.endswith("+") else kl_penalty
forward_score = kl_penalty_forward(logprob, ref_logprob, base_kl_penalty)
if not kl_penalty.endswith("+") or kl_penalty in ("mse", "k2"):
return forward_scoreFiles changed:
verl/trainer/ppo/core_algos.py— 1-line fix + 1-line comment.tests/trainer/ppo/test_core_algos_on_cpu.py— 2 regression tests.
Checklist Before Submitting
- Read the Contribute Guide.
- Apply pre-commit
checks:
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always— all 12 hooks pass on the touched files (ruff,ruff-format,mypy,autogen-trainer-cfg,check-docs-time-info,check-docstrings,check-license,check-device-api-usage,check-dataproto-usage,validate-structure,check-naming-conventions,compileall). - Add / Update the
documentation —
N/A (bug fix; the suffix was already documented in the
kl_penaltydocstring). - Add unit or end-to-end test(s) to the CI
workflow.
The new tests are added to
tests/trainer/ppo/test_core_algos_on_cpu.pyand are picked up automatically bycpu_unit_tests.yml(tests/**/test_*_on_cpu.py). - Once your PR is ready for CI, send a message in the
ci-requestchannel. - Not related to the
recipesubmodule.
xvlincaigou pushed a commit to xvlincaigou/verl that referenced this pull request
What does this PR do?
Fix a long-standing bug in core_algos.kl_penalty that makes every
+-suffixed estimator name (k1+, kl+, abs+, k3+, low_var_kl+)
crash with NotImplementedError on the first training step that touches
KL — either through algorithm.kl_penalty (KL-in-reward) or
actor.kl_loss_type (actor KL loss).
The straight-through trick that the + suffix is supposed to enable
(unbiased k3 value with unbiased k2 gradient, see
Schulman, Approximating KL
divergence)
was added in verl-project#2953 but has never actually worked end-to-end because the
suffix was forwarded to kl_penalty_forward without being stripped,
causing the dispatch to fall through to raise NotImplementedError.
Checklist Before Starting
- Search for similar PRs. Paste at least one query link here:
is:pr kl_penalty k3+— 0 open, 2 closed (docstring typo and a rename, neither addresses this bug)is:pr kl_penalty endswith— 0 resultsis:pr kl_penalty_forward— 0 resultsis:issue "k3+"— no related issue- Format the PR title as
[{modules}] {type}: {description}—[algo] fix: …. Single-line behavior fix inverl/trainer/ppo/core_algos.py, no API surface change.
Test
Added two CPU regression tests in
tests/trainer/ppo/test_core_algos_on_cpu.py that get picked up
automatically by cpu_unit_tests.yml (tests/**/test_*_on_cpu.py):
test_kl_penalty_straight_through_value_matches_base— parametrized over(k1+, kl+, abs+, k3+, low_var_kl+), asserts that the suffixed estimator returns the same forward value as its base.test_kl_penalty_k3_plus_uses_k2_gradient— asserts that the gradient w.r.t.logprobproduced byk3+is exactly the gradient produced byk2, i.e. the straight-through trick is wired correctly.
Local run on top of the change:
$ pytest -xvs tests/trainer/ppo/test_core_algos_on_cpu.py
============================= test session starts ==============================
collected 22 items
tests/trainer/ppo/test_core_algos_on_cpu.py ...................... [100%]
============================== 22 passed in 4.79s ==============================Without the fix, every ..._straight_through_value_matches_base[*]
case raises NotImplementedError in kl_penalty_forward.
API and Usage Example
No API change. The fix simply makes the previously-documented +
suffix work as advertised. After this PR the following configurations
are usable (they currently crash on step 1):
# KL-in-reward with the straight-through trick:
python -m verl.trainer.main_ppo \
algorithm.use_kl_in_reward=True \
algorithm.kl_penalty=k3+ \
algorithm.kl_ctrl.kl_coef=1e-3
# Actor KL loss with the straight-through trick:
python -m verl.trainer.main_ppo \
actor_rollout_ref.actor.use_kl_loss=True \
actor_rollout_ref.actor.kl_loss_type=low_var_kl+ \
actor_rollout_ref.actor.kl_loss_coef=1e-3Design & Code Changes
Root cause is a single-line oversight in core_algos.kl_penalty:
# before
forward_score = kl_penalty_forward(logprob, ref_logprob, kl_penalty)
if not kl_penalty.endswith("+") or kl_penalty in ("mse", "k2"):
return forward_scorekl_penalty_forward only recognizes the bare names (k1, kl, abs,
mse, k2, k3, low_var_kl, full), so it falls through to
raise NotImplementedError whenever the input ends with +.
Fix: strip the optional suffix before dispatch, leaving the outer
endswith("+") switch untouched so the straight-through path is still
selected correctly:
# after
base_kl_penalty = kl_penalty[:-1] if kl_penalty.endswith("+") else kl_penalty
forward_score = kl_penalty_forward(logprob, ref_logprob, base_kl_penalty)
if not kl_penalty.endswith("+") or kl_penalty in ("mse", "k2"):
return forward_scoreFiles changed:
verl/trainer/ppo/core_algos.py— 1-line fix + 1-line comment.tests/trainer/ppo/test_core_algos_on_cpu.py— 2 regression tests.
Checklist Before Submitting
- Read the Contribute Guide.
- Apply pre-commit
checks:
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always— all 12 hooks pass on the touched files (ruff,ruff-format,mypy,autogen-trainer-cfg,check-docs-time-info,check-docstrings,check-license,check-device-api-usage,check-dataproto-usage,validate-structure,check-naming-conventions,compileall). - Add / Update the
documentation —
N/A (bug fix; the suffix was already documented in the
kl_penaltydocstring). - Add unit or end-to-end test(s) to the CI
workflow.
The new tests are added to
tests/trainer/ppo/test_core_algos_on_cpu.pyand are picked up automatically bycpu_unit_tests.yml(tests/**/test_*_on_cpu.py). - Once your PR is ready for CI, send a message in the
ci-requestchannel. - Not related to the
recipesubmodule.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
[ Show hidden characters]({{ revealButtonHref }})