YAML Configuration Settings — AWS Neuron Documentation (original) (raw)
YAML Configuration Settings#
The library allows configuring a bunch of parameters in the YAML file to run large scale training. The important categories and parameters are highlighted below. At the top level, we have the following keys:
name: # Name of the experiment model_source: # Model source code, could be megatron or hf seed: # Random seed to be used for the entire experiment trainer: # Settings to configure the PyTorch-Lightning trainer exp_manager: # Settings to configure logging/checkpointing distributed_strategy: # Settings to configure how the model is to be distributed across devices data: # Settings to configure the dataset/dataloader model: # Settings to configure the model architecture and the optimizer precision: # Settings to configure the model precision compiler_flags: # Neuron compiler flags to be used compiler_cache_url: # Cache to be used to save the compiled artifacts aync_exec_max_inflight_requests: # Used to configure the runtime queue bucket_size_collectives: # Collectives are batched into tensors of this size (in MBs) neuron_rt_exec_timeout: # Runtime timeout neuron_experimental_compress_rg: # To use compress replica group
Trainer#
Neuronx Distributed Trainer framework is built on top of PyTorch-Lightningand this key allows users to configure the trainer
.
devices: 32 num_nodes: 1 max_epochs: -1 max_steps: 20000 log_every_n_steps: 1 val_check_interval: 20000 check_val_every_n_epoch: null num_sanity_val_steps: 0 limit_val_batches: 1 limit_test_batches: 1 gradient_clip_val: 1.0 lnc: 2 sequential_move_factor: 11
Note
All the above trainer parameters follow the exact same definition of the PyTorch-Lightning Trainer. More information about each of them can be foundhere.
devices
Number of devices to be used for training. If using torchrun, this is equal to nproc_per_node * num_nodes
.
- Type: integer
- Required: True
lnc
Neuron-specific setting that specifies the logical-to-physical Neuron Core mapping ratio. This parameter determines the number of physical Neuron cores used for each logical Neuron Core.
Values:
- lnc: 1 - Each node exposes 128 logical devices, with a 1:1 mapping between logical and physical Neuron Cores.
- lnc: 2 - Implements a 2:1 mapping between logical and physical Neuron Cores.
- Type: integer
- Required: False
- Default: None (must be explicitly set)
num_nodes
Number of nodes to be used for training
- Type: integer
- Required: True
max_epochs
Maximum number of epochs to run. A value of -1
means that the number of training steps would be inferred from max_steps
- Type: integer
- Required: True
log_every_n_steps
How often to log loss values
- Default value: 1
- Type: integer
- Required: True
val_check_interval
How often to run validation step. Using this parameter one can run validation step after X
training steps.
- Type: integer
- Required: True
check_val_every_n_epoch
Another parameter that controls the frequency of validation step. Using this parameter, one can run valiation step after X
epochs.
- Type: integer
- Required: True
num_sanity_val_steps
How many sanity validation steps to run. Keeping it to 0
would not run validation step at the start of training.
- Type: integer
- Required: True
limit_val_batches
Number of batches to run validation step on.
- Type: integer
- Required: True
gradient_clip_val
Float value to clip gradients at.
- Type: float
- Required: True
sequential_move_factor
Number of ranks/devices participating in initializing the model weights in parallel. Useful to reduce init time when using TP-PP config. The value can be increased upto the number of trainer.devices
being used.
- Default value: 11
- Type: integer
- Required: False
Experiment Manager#
This setting is mainly for configuring different aspects of experiment management like checkpointing, experiment logging directory, which parameters to log and how often to log, etc.
log_local_rank_0_only: True create_tensorboard_logger: True explicit_log_dir: null exp_dir: null name: megatron_llama resume_if_exists: True resume_ignore_no_checkpoint: True create_checkpoint_callback: True checkpoint_callback_params: monitor: step save_top_k: 1 mode: max save_last: False filename: 'megatron_llama--{step}-{consumed_samples}' every_n_train_steps: 200 use_master_weights_in_ckpt: False log_parameter_norm: True log_gradient_norm: True enable_recovery_time_instrumentation: False save_xser: True load_xser: True async_checkpointing: False resume_from_checkpoint: null
log_local_rank_0_only
Log only on rank 0. The recommended setting should be True
- Type: bool
- Default: False
- Required: False
create_tensorboard_logger
Setting this True
would log the loss and other parameters to tensorboard.
- Type: bool
- Default: False
- Required: False
exp_log_dir
Explicitly specify the logging directory. Otherwise, the framework would save to current directory as default.
- Type: str
- Default: null
- Required: False
resume_if_exists
Set this to True
to resume from an existing checkpoint. This config will be useful when we want to auto-resume from a failed training job.
- Type: bool
- Default: False
- Required: False
resume_ignore_no_checkpoint
Experiment manager errors out if resume_if_exists
is True
and no checkpoint could be found. This behaviour can be disabled, in which case exp_manager will print a message and continue without restoring, by setting resume_ignore_no_checkpoint
to True
.
- Type: bool
- Default: False
- Required: False
checkpoint_callback_params.save_top_k
How many checkpoints to keep around. Example: If set to 1, only 1 checkpoint at any given time would be kept around. The framework would automatically keep deleting checkpoints.
- Type: int
- Required: True
checkpoint_callback_params.every_n_train_steps
How often we want to checkpoint.
- Type: int
- Required: True
checkpoint_callback_params.use_master_weights_in_ckpt
Whether or not to save master weights when checkpointing.
- Type: bool
- Default: False
- Required: False
log_parameter_norm
Set this to log parameter norm across model parallel ranks.
- Type: bool
- Default: False
- Required: False
log_gradient_norm
Set this to log gradient norm across model parallel ranks.
- Type: bool
- Default: False
- Required: False
enable_recovery_time_instrumentation
Set this if you don’t want to default to not printing the detailing timing for recovery.
- Type: bool
- Default: False
- Required: False
save_xser
Set this to save with torch xla serialization to reduce time saving, it’s recommended to enable xser
for significantly faster save/load. Note that if the checkpoint is saved with xser
, it can only be loaded with xser
, vice versa.
- Type: bool
- Default: False
- Required: False
load_xser
Set this to load with torch xla serialization to reduce time saving, it’s recommended to enable xser
for significantly faster save/load. Note that if the checkpoint is saved with xser
, it can only be loaded with xser
, vice versa.
- Type: bool
- Default: False
- Required: False
async_checkpointing
Set this if you want to use async checkpointing. Under the hood the library uses the async checkpointing feature provided by NeuronxDistributed’ssave API.
- Type: bool
- Default: False
- Required: False
resume_from_checkpoint
Set this as the checkpoint file to load from. Check the SFT/DPO/ORPO example config under conf
on how to use it.
- Type: str
- Default: null
- Required: False
ckpt_ptl_version
Set this only if your checkpoint does not contain the pytorch-lightning version in it. This version is the pytorch-lightning version the checkpoint was saved with.
- Type: str
- Default: “2.5.0”
- Required: False
Distributed Strategy#
tensor_model_parallel_size: 8 pipeline_model_parallel_size: 1 virtual_pipeline_model_parallel_size: 1 zero1: True sequence_parallel: True kv_replicator: 4
This setting allows users to configure the sharding strategy to be used for distributing the model across workers.
tensor_model_parallel_size
Tensor parallel degreeto be used for sharding models.
- Type: int
- Required: True
pipeline_model_parallel_size
Pipeline parallel degreeto be used for sharding models.
- Type: int
- Required: True
virtual_pipeline_model_parallel_size
Interleaved pipeline parallel degree. Use a value of 1 if no pipeline parallelism is used.
- Type: int
- Required: True
context_parallel_size
Context parallel degree to be used for sharding sequence. When context_parallel_size is greater than 1,fusions.ring_attention
must be set to True
.
- Type: int
- Required: False
- Default: 1
zero1
Wraps the optimizer with zero1.
- Type: bool
- Required: True
sequence_parallel
To shard along the sequence dimension. Sequence Parallel is always used in conjuction with tensor parallel. The sequence dimension will be sharded with the same degree as the tensor_model_parallel_size
.
- Type: bool
- Required: True
kv_replicator
This parameter is used together with qkv_linear
parameter. It is used to configure theGQAQKVLinear module
- Type: bool
- Required: True
Data#
This is where we configure the dataset/dataloader. This config is dependent on the dataloader/dataset been used. Users can add custom keys in this config and read inside the CustomDataModule
using cfg.data
. Currently the library adds support for 3 kinds of data modules: MegatronDataModule
, ModelAlignmentDataModule
and HFDataModule
. To learn about the config parameters of MegatronDataModule
please check themegatron_llama_7B_config.yaml
, for ModelAlignmentDataModule
check the megatron_llama2_7B_SFT_config.yaml
and for HFDataModule
, refer to hf_llama3_8B_config.yaml
.
The parameters that are common across all the configs are documented below.
micro_batch_size: 1 global_batch_size: 1024
micro_batch_size
The batch is distributed across multiple data parallel ranks and within each rank, we accumulate gradients. Micro batch size is the size that is used for each of those gradient calculation steps.
- Type: int
- Required: True
global_batch_size
This config along with micro-batchsize decides the gradient accumulation number automatically.
- Type: int
- Required: True
Model#
This is where we can configure the model architecture. When building custom models, this config can be used to parameterize the custom model. The below parameters are taken from an example of the Megatron model config. Depending on the model and required parameters, this config can change.
HF Model#
Let’s start with the config for the HF model:
model architecture
model_config: /home/ubuntu/config.json encoder_seq_length: 4096 max_position_embeddings: ${.encoder_seq_length} num_layers: 4 hidden_size: 4096 qkv_linear: False
Miscellaneous
use_cpu_initialization: True
Activation Checkpointing
activations_checkpoint_granularity: selective activations_checkpoint_recompute: [CoreAttention]
fusions: softmax: True flash_attention: False
do_layer_norm_weight_decay: False
optim: name: adamw_fp32OptState lr: 3e-4 weight_decay: 0.01 capturable: False betas: - 0.9 - 0.999 sched: name: LinearAnnealingWithWarmUp warmup_steps: 100 max_steps: ${trainer.max_steps}
model_config
Points to the config.json
path required by the transformers
model implementation. One such example ofconfig.json
is here
- Type: str
- Required: True
encoder_seq_length
Setting the sequence length for the training job. This parameter is common for all models supported in the library.
- Type: int
- Required: True
num_layers
This config will override the number of layers inside the config.json
in the model_config
. This is exposed so that one can quickly increase/decrease the size of the model. This parameter is common for all models supported in the library.
- Type: int
- Required: True
hidden_size
This config will override the hidden_size
inside the config.json
in the model_config
. This parameter is common for all models supported in the library.
- Type: int
- Required: True
qkv_linear
This needs to be set if users want to use theGQAQKVLinear module
- Type: bool
- Required: True
fuse_qkv
This is set if users want to use fused q, k and v tensors inGQAQKVLinear module Using fuse_qkv can improve throughput. This parameter is True by default.
- Type: bool
- Required: False
transpose_nki_inputs
This is set if users want to transpose the inputs to NKI FlashAttention function. To be used only whenfusions.flash_attention
is True
. Using transpose_nki_inputs
with fusions.flash_attention
can improve throughput. This parameter is True by default for all models, unless used otherwise.
- Type: bool
- Required: False
pipeline_cuts
This is set as a list of layer names if users want to specify manual cut points for pipeline parallelism. One example is [‘model.layers.10’, ‘model.layers.20’] in the case of PP=3.
- Type: List[str]
- Required: False
Note
When using this param, the number of pipeline cuts should always be pipeline_model_parallel_size-1
.
use_cpu_initialization
Setting this flag to True
will initialize the weights on CPU
and then move to device. It is recommended to set this flag to True
. This parameter is common for all models supported in the library.
- Type: bool
- Required: True
activations_checkpoint_granularity
This flag controls which module needs to be recomputed during the backward pass.
Values:
selective
- Enables selective recomputation of specified
modules in activations_checkpoint_recompute during the backward pass.full
- Saves activations at layer boundaries and recomputes the entire layer during the backward pass.null
- Disables activation checkpointing.
More information on activation recompute can be foundin this link. This parameter is common for all models supported in the library.
- Type: str
- Possible Values:
selective
,full
,null
- Required: True
activations_checkpoint_recomputeThis config specifies which modules to recompute when using selective
activation checkpointing. It accepts a list of module names as strings or null.
- Type: list[str] or null
- Required: False
fusions.softmax
Setting this flag to True
will replace the torch.nn.Softmax
with a fused custom Softmax
operator. This parameter is common for all models supported in the library.
- Type: bool
- Required: True
fusions.flash_attention
Setting this flag to True
will insert the flash attention module for both forward and backward. This parameter is common for all models supported in the library.
- Type: bool
- Required: True
fusions.ring_attention
Setting this flag to True
will use the ring attention module for both forward and backward. This parameter must be true when context_parallel_size
is greater than 1.
- Type: bool
- Required: False
fusions.do_layer_norm_weight_decay
Setting this flag to True
will add layer norm weight decay. This parameter is common for all models supported in the library.
- Type: bool
- Required: True
optim
This is where the optimizers can be set. We can configure the optimizers supported by NeMo
. All the optimzers can be configured according to theparameters specified here.
- Type: config
- Possible Values:
adamw
,adamw_fp32OptState
,sgd
,adam
,adadelta
,adamax
,adagrad
,rmsprop
,rprop
,novograd
,adafactor
- Required: True
optim.sched
This is where the LR schedulers can be set. We can configure the schedulers supported by NeMo
. All the schedulers can be configured according to theparameters specified here.
- Type: config
- Possible Values:
LinearAnnealingWithWarmUp
,CosineAnnealing
,WarmupPolicy
,WarmupHoldPolicy
,SquareAnnealing
,NoamAnnealing
,WarmupAnnealing
,StepLR
,rprop
,ExponentialLR
- Required: True
Megatron Model#
The library enables amegatron transformermodel which can be configured from the yaml file. The different available parameters are documented below after the following reference example.
model architecture
encoder_seq_length: 4096 max_position_embeddings: ${.encoder_seq_length} num_layers: 32 hidden_size: 4096 ffn_hidden_size: 11008 num_attention_heads: 32 num_kv_heads: 32 init_method_std: 0.021 hidden_dropout: 0 attention_dropout: 0 ffn_dropout: 0 apply_query_key_layer_scaling: True normalization: 'rmsnorm' layernorm_epsilon: 1e-5 do_layer_norm_weight_decay: False # True means weight decay on all params make_vocab_size_divisible_by: 8 # Pad the vocab size to be divisible by this value for computation efficiency. persist_layer_norm: True # Use of persistent fused layer norm kernel. share_embeddings_and_output_weights: False # Untie embedding and output layer weights. position_embedding_type: 'rope' # Position embedding type. Options ['learned_absolute', 'rope] rotary_percentage: 1 # If using position_embedding_type=rope, then the per head dim is multiplied by this. activation: 'swiglu' # ['swiglu', 'gelu'] has_bias: False
Miscellaneous
use_cpu_initialization: True
Activation Checkpointing
activations_checkpoint_granularity: selective # 'selective' or 'full'
fusions: softmax: True flash_attention: False # Use NKI flash attention
optim: name: adamw lr: 3e-4 weight_decay: 0.1 capturable: True betas: - 0.9 - 0.95 sched: name: CosineAnnealing warmup_steps: 2000 constant_steps: 0 min_lr: 3.0e-5
Note
For common config, please refer to the HF Model
section above.
ffn_hidden_size
Transformer FFN hidden size.
- Type: int
- Required: True
num_attention_heads
Number of Q
attention heads.
- Type: int
- Required: True
num_kv_heads
Number of KV
heads. This is where we can configure Q
and KV
differently to create GQA
modules.
- Type: int
- Required: True
init_method_std
Standard deviation to use when we init layers of the transformer model.
- Type: float
- Required: True
hidden_dropout
Dropout probability for hidden state transformer.
- Type: float
- Required: True
attention_dropout
Dropout probability in the attention layer.
- Type: float
- Required: True
ffn_dropout
Dropout probability in the feed-forward layer.
- Type: float
- Required: True
apply_query_key_layer_scaling
Scale Q * K^T
by (1 / layer-number)
.
- Type: bool
- Required: True
normalization
Normalization layer to use.
- Type: str
- Possible Values:
rmsnorm
,layernorm
- Required: True
layernorm_epsilon
Epsilon value for layernorm.
- Type: float
- Required: True
share_embeddings_and_output_weights
Setting this parameter to True
will tie the vocab embedding
weight with the final MLP
weight.
- Type: bool
- Required: True
make_vocab_size_divisible_by
So lets say your vocab size is 31999
and you set this value to 4, the framework would pad the vocab-size such that it becomes divisible by 4
. In this case the close divisible value is 32K
.
- Type: int
- Required: True
position_embedding_type
Type of position embedding to be used.
- Type: str
- Possible Values:
learned_absolute
,rope
- Required: True
rotary_percentage
If using position_embedding_type=rope
, then the per head dim is multiplied by this factor.
- Type: float
- Required: True
activation
Users can specify the activation function to be used in the model.
- Type: str
- Possible Values:
swiglu
,gelu
- Required: True
has_bias
Setting this parameter to True
will add bias to each of the linear layers in the model.
- Type: bool
- Required: True
Precision#
This config can help to decide the dtype of the model/optimizer.
precision: type: 'mixed_precision' # ['bf16SR', 'fp32', 'autocast', 'mixed_precision', 'mixed_precisionSR', 'manual'] # Set the following only if precision type is manual, otherwise they will be automatically set. master_weights: False fp32_grad_acc: False xla_use_bf16: '0' xla_downcast_bf16: '0' neuron_rt_stochastic_rounding_en: '0' parallel_layers_reduce_dtype: 'bf16'
Note
Only if the precision type is manual
, master_weights
, fp32_grad_acc
, xla_use_bf16
, xla_downcast_bf16
,neuron_rt_stochastic_rounding_en
will be picked up from the config. These parameters are for more finer control of precision. It is recommended to use mixed_precision
config for better accuracy.
type
mixed_precision
The mixed_precision
config uses the zero1
optimizer. It performs grad accumulation,grad cc
, and keeps the master copy of the weights in fp32
. It also sets the xla_downcast_bf16
environment variable to 1 and disables stochastic rounding.
mixed_precisionSR
mixed_precisionSR
is a superset of the mixed_precision
config with stochastic rounding enabled.
bf16SR
bf16SR
config will perform all operations in bf16
and relies on stochastic rounding feature for accuracy gains.
autocast
autocast
config will follow the exact same precision strategy followed by torch.autocast
.
Note
Autocast is not supported in this release.
manual
To gain control of the different precision nobs, one can set the precision type to manual
and control parameters like - master_weights
, fp32_grad_acc
, xla_use_bf16
, xla_downcast_bf16
andneuron_rt_stochastic_rounding_en
.
parallel_layers_reduce_dtype
This config will perform reduce collectives (all-reduce and reduce-scatter) within parallel layers in the specified precision. If fp32
precision type is used, then we implicitly set reduce dtype to fp32
. Otherwise it will be defaulted to bf16
in all other cases unless specified.
Model Alignment Specific#
You can configure fine-tuning (SFT) or model alignment (DPO/ORPO) through the YAML file, along with parameter-efficient fine-tuning using LoRA.
model_alignment_strategy: # DPO specific config dpo: kl_beta: 0.01 loss_type: sigmoid max_prompt_length: 2048 precompute_ref_log_probs: True truncation_mode: keep_start
# Alternatively, you can also use SFT specific config
sft:
packing: True
# Alternatively, can also use ORPO specific config
orpo:
beta: 0.01
max_prompt_length: 2048
truncation_mode: keep_start
# Parameter-efficient finetuning - LoRA config
peft:
lora_rank: 16
lora_alpha: 32
lora_dropout: 0.05
lora_bias: "none"
lora_verbose: True
target_modules: ["qkv_proj"]
model_alignment_strategy
Set only when using finetuning specific algorithms (SFT, DPO, etc) and related hyperparameters DPO-specific parameters.
dpo
kl_beta
KL-divergence beta to control divergence of policy model from reference model
- Type: float
- Default: 0.01
- Required: True
loss_type
Currently support sigmoid version of optimized DPO loss
- Type: str
- Default:
sigmoid
- Required: True
max_prompt_length
Set maximum length of prompt in the concatenated prompt and (chosen/rejected) response input
- Type: integer
- Required: True
precompute_ref_log_probs
To enable precomputation of reference model log probabilities using pre-fit hook, False is not supported currently
- Type: bool
- Required: True
truncation_mode
To define how to truncate if size (prompt+response) exceeds seq_length options: [“keep_start”, “keep_end”]
- Type: str
- Default:
keep_start`
- Required: True
SFT-specific parameters.
sft
packing
Appends multiple records in a single record until seq length supported by model, if false uses pad tokens to reach seq length. Setting it to True increases throughput but might impact accuracy.
- Type: bool
- Default: False
- Required: False
Odds Ratio Preference Optimization (ORPO)specific parameters.
orpo
beta
KL-divergence beta to control divergence of policy model from reference model
- Type: float
- Default: 0.01
- Required: True
max_prompt_length
Set maximum length of prompt in the concatenated prompt and (chosen/rejected) response input
- Type: integer
- Required: True
truncation_mode
To define how to truncate if size (prompt+response) exceeds seq_length options: [“keep_start”, “keep_end”]
- Type: str
- Default:
keep_start`
- Required: True
peft
Configuration options for Parameter-Efficient Fine-Tuning (PEFT) methods, specifically LoRA settings.
lora_rank
Rank of LoRA; determines the number of trainable parameters Higher rank allows for more expressive adaptations but increases memory usage
- Type: int
- Default: 16
- Required: True
lora_alpha
Scaling factor for LoRA updates; affects the magnitude of LoRA adaptations.
- Type: int
- Default: 32
- Required: True
lora_dropout
Dropout rate for LoRA layers to prevent overfitting.
- Type: float
- Default: 0.05
- Required: False
lora_bias
Bias type for LoRA. Determines which biases are trainable. Can be ‘none’, ‘all’ or ‘lora_only’
- Type: str
- Default: “none”
- Required: False
lora_verbose
Enables detailed LoRA-related logging during training.
- Type: bool
- Default: False
- Required: False
target_modules
List of model layers to apply LoRA.
- Type: list[str]
- Default: [“qkv_proj”] (for Llama)
- Required: True
This document is relevant for: Trn1
, Trn2