Config Explanation — verl documentation (original) (raw)

Last updated: 06/18/2025.

ppo_trainer.yaml for RL FSDP Backend

Data

data: tokenizer: null train_files: ~/data/rlhf/gsm8k/train.parquet val_files: ~/data/rlhf/gsm8k/test.parquet train_max_samples: -1 # set to -1 to use full dataset val_max_samples: -1 # set to -1 to use full dataset prompt_key: prompt max_prompt_length: 512 max_response_length: 512 train_batch_size: 1024 return_raw_input_ids: False # This should be set to true when the tokenizer between policy and rm differs return_raw_chat: False return_full_prompt: False shuffle: True seed: 42 filter_overlong_prompts: False filter_overlong_prompts_workers: 1 truncation: error image_key: images trust_remote_code: True custom_cls: path: null name: null

Customized Dataset

Customized dataset extension is implemented for the SFT trainer and can be extended to other trainers with similar changes.

custom_cls: path: null name: null

Actor/Rollout/Reference Policy

actor_rollout_ref: hybrid_engine: True model: path: ~/models/deepseek-llm-7b-chat external_lib: null override_config: attn_implementation: flash_attention_2 # or eager, sdpa - attention implementation override model_config: {} moe_config: # Megatron only, can adjust moe configuration freeze_moe_router: False # Megatron only, can freeze moe router (no grad) enable_gradient_checkpointing: False enable_activation_offload: False trust_remote_code: False use_remove_padding: False actor: strategy: fsdp # This is for backward-compatibility ppo_mini_batch_size: 256 ppo_micro_batch_size: null # will be deprecated, use ppo_micro_batch_size_per_gpu ppo_micro_batch_size_per_gpu: 8 use_dynamic_bsz: False ppo_max_token_len_per_gpu: 16384 # n * data.maxpromptlength+{data.max_prompt_length} + data.maxpromptlength+{data.max_response_length} grad_clip: 1.0 clip_ratio: 0.2 entropy_coeff: 0.0 use_kl_loss: False # True for GRPO

Rollout Correction (corrects distribution mismatch between rollout and training)

rollout_correction: rollout_is: token # IS weights rollout_is_threshold: 2.0 # TIS upper bound, or "0.5_5.0" for IcePop rollout_rs: null # Rejection sampling rollout_rs_threshold: null # RS upper threshold use_torch_compile: True # False to disable torch compile kl_loss_coef: 0.001 # for grpo kl_loss_type: low_var_kl # for grpo ppo_epochs: 1 data_loader_seed: null shuffle: False ulysses_sequence_parallel_size: 1 # sp size optim: lr: 1e-6 lr_warmup_steps: -1 # Prioritized. Negative values mean delegating to lr_warmup_steps_ratio. lr_warmup_steps_ratio: 0. # the total steps will be injected during runtime min_lr_ratio: 0.0 # only used with cosine lr scheduler, default to 0.0 num_cycles: 0.5 # only used with cosine lr scheduler, default to 0.5 lr_scheduler_type: constant # select from constant/cosine total_training_steps: -1 # must be override by program fsdp_config: wrap_policy: # transformer_layer_cls_to_wrap: None min_num_params: 0 param_offload: False optimizer_offload: False fsdp_size: -1 checkpoint: # What to include in saved checkpoints # 'hf_model' saves the full model in HuggingFace format. For Megatron this requires # actor.megatron.use_mbridge=True (the default); 'model' and 'hf_model' then produce # the same HF checkpoint and are deduplicated (saved once). With mbridge disabled, # only the sharded 'model' is supported -- use verl.model_merger after training to # convert it to HF format. save_contents: ['model', 'optimizer', 'extra'] # For more flexibility, you can specify the contents to load from the checkpoint. load_contents: ${actor_rollout_ref.actor.checkpoint.save_contents} ref: fsdp_config: param_offload: False wrap_policy: # transformer_layer_cls_to_wrap: None min_num_params: 0 log_prob_micro_batch_size: null # will be deprecated, use log_prob_micro_batch_size_per_gpu log_prob_micro_batch_size_per_gpu: 16 log_prob_use_dynamic_bsz: ${actor_rollout_ref.actor.use_dynamic_bsz} log_prob_max_token_len_per_gpu: ${actor_rollout_ref.actor.ppo_max_token_len_per_gpu} ulysses_sequence_parallel_size: ${actor_rollout_ref.actor.ulysses_sequence_parallel_size} # sp size rollout: name: vllm temperature: 1.0 top_k: -1 # 0 for hf rollout, -1 for vllm rollout top_p: 1 prompt_length: ${data.max_prompt_length} # not use for opensource response_length: ${data.max_response_length}

for vllm rollout

dtype: bfloat16 # should align with FSDP gpu_memory_utilization: 0.5 ignore_eos: False enforce_eager: True free_cache_engine: True load_format: dummy_dtensor tensor_model_parallel_size: 2 max_num_batched_tokens: 8192 max_num_seqs: 1024 log_prob_micro_batch_size: null # will be deprecated, use log_prob_micro_batch_size_per_gpu log_prob_micro_batch_size_per_gpu: 16 log_prob_use_dynamic_bsz: ${actor_rollout_ref.actor.use_dynamic_bsz} log_prob_max_token_len_per_gpu: ${actor_rollout_ref.actor.ppo_max_token_len_per_gpu}

for hf rollout

do_sample: True engine_kwargs: # inference engine parameters, please refer vllm/sglang official doc for detail vllm: {} sglang: {}

n: 1 # for each prompt, sample n responses (i.e. num sample times). set it to values > 1 for grpo, rloo calculate_log_probs: False # set to True for computing log probs via rollouts val_kwargs: # sampling parameters for validation top_k: -1 # 0 for hf rollout, -1 for vllm rollout top_p: 1.0 temperature: 0 n: 1 do_sample: False # default eager for validation

agent: custom_async_server: # Use custom async server implementation for rollout path: null name: null

Common config for actor, rollout and reference model

Actor model

Reference Model

Reference model will be enabled when actor.use_kl_loss or/and algorithm.use_kl_in_reward is/are True.

Rollout Model

Note

NOTED: In this config field, users only need to select from dummy_megatron, dummy_dtensor, dummy_hf for rollout initialization and our hybrid engine will select the corresponding weight loader (i.e., megatron, dtensor, hf) during actor/rollout weight synchronization.

Megatron Optimizer and Optimizer Parameter Scheduler

optim: optimizer: adam lr: 1e-6 clip_grad: 1.0 total_training_steps: -1 # must be override by program lr_warmup_init: 0.0 # initial learning rate for warmup, default to 0.0 lr_warmup_steps: -1 # Prioritized. Negative values mean delegating to lr_warmup_steps_ratio. lr_warmup_steps_ratio: 0. # the total steps will be injected during runtime lr_decay_steps: null lr_decay_style: constant # select from constant/linear/cosine/inverse_square_root min_lr: 0.0 # minimum learning rate, default to 0.0 weight_decay: 0.01 weight_decay_incr_style: constant # select from constant/linear/cosine lr_wsd_decay_style: exponential # select from constant/exponential/cosine lr_wsd_decay_steps: null use_checkpoint_opt_param_scheduler: False # use checkpoint optimizer parameter scheduler

Notice that there are some differences in APIs between Megatron optimizer and FSDP optimizer.

For learning rate decay, original Megatron pretrain default option of lr_decay_style is linear, meaning that the learning rate will be linearly decayed from the initial learning rate to min_lr within thelr_decay_steps. However, in verl, to align with FSDP’s default behavior, we set the defaultlr_decay_style to constant, meaning that the learning rate will be kept constant after the warmup stage.

Critic Model

Most parameters for Critic are similar to Actor Model.

Reward Model

reward_model: enable: False model: input_tokenizer: ${actor_rollout_ref.model.path} # set this to null if the chat template is identical path: ~/models/Anomy-RM-v0.1 external_lib: ${actor_rollout_ref.model.external_lib} trust_remote_code: False fsdp_config: min_num_params: 0 param_offload: False micro_batch_size_per_gpu: 16 max_length: null reward_manager: naive

Customized Reward Function

custom_reward_function: path: null name: compute_score

Algorithm

algorithm: gamma: 1.0 lam: 1.0 adv_estimator: gae use_kl_in_reward: False kl_penalty: kl # how to estimate kl divergence kl_ctrl: type: fixed kl_coef: 0.005 horizon: 10000 target_kl: 0.1

Rollout Correction

rollout_correction: rollout_is: null # IS weights rollout_is_threshold: 2.0 # Upper threshold for IS weights rollout_rs: null # Rejection sampling rollout_rs_threshold: null # RS upper threshold

Trainer

trainer: total_epochs: 30 project_name: verl_examples experiment_name: gsm8k logger: ['console', 'wandb'] log_val_generations: 0 nnodes: 1 n_gpus_per_node: 8 save_freq: -1 val_before_train: True test_freq: 2 critic_warmup: 0 default_hdfs_dir: null # hdfs checkpoint path default_local_dir: checkpoints/${trainer.project_name}/${trainer.experiment_name} # local checkpoint path resume_mode: auto # or disable or resume_path if resume_from_path is set resume_from_path: null remove_previous_ckpt_in_save: False del_local_ckpt_after_load: False ray_wait_register_center_timeout: 300

This figure illustrates how the configurations affect the training.

https://excalidraw.com/#json=pfhkRmiLm1jnnRli9VFhb,Ut4E8peALlgAUpr7E5pPCA

https://github.com/user-attachments/assets/16aebad1-0da6-4eb3-806d-54a74e712c2d

evaluation.yaml

Data

data: path: /tmp/math_Qwen2-7B-Instruct.parquet prompt_key: prompt response_key: responses data_source_key: data_source reward_model_key: reward_model

Customized Reward Function

custom_reward_function: path: null name: compute_score

sft_trainer.yaml for SFT FSDP Backend

Optim

optim: optimizer: AdamW optimizer_impl: torch.optim lr: 1e-5 weight_decay: 0.01 lr_warmup_steps_ratio: 0.1 clip_grad: 1.0 lr_scheduler: cosine override_optimizer_config: null

Model

Most parameters for Model are similar to Reward Model.

model: partial_pretrain: ~/models/gemma-1.1-7b-it fsdp_config: model_dtype: fp32 wrap_policy: min_num_params: 0 cpu_offload: False offload_params: False external_lib: null enable_gradient_checkpointing: False trust_remote_code: False lora_rank: 0 lora_alpha: 16 target_modules: all-linear use_liger: False