YAML Configuration Settings — AWS Neuron Documentation (original) (raw)

YAML Configuration Settings#

The library allows configuring a bunch of parameters in the YAML file to run large scale training. The important categories and parameters are highlighted below. At the top level, we have the following keys:

name: # Name of the experiment model_source: # Model source code, could be megatron or hf seed: # Random seed to be used for the entire experiment trainer: # Settings to configure the PyTorch-Lightning trainer exp_manager: # Settings to configure logging/checkpointing distributed_strategy: # Settings to configure how the model is to be distributed across devices data: # Settings to configure the dataset/dataloader model: # Settings to configure the model architecture and the optimizer precision: # Settings to configure the model precision compiler_flags: # Neuron compiler flags to be used compiler_cache_url: # Cache to be used to save the compiled artifacts aync_exec_max_inflight_requests: # Used to configure the runtime queue bucket_size_collectives: # Collectives are batched into tensors of this size (in MBs) neuron_rt_exec_timeout: # Runtime timeout neuron_experimental_compress_rg: # To use compress replica group

Trainer#

Neuronx Distributed Trainer framework is built on top of PyTorch-Lightningand this key allows users to configure the trainer.

devices: 32 num_nodes: 1 max_epochs: -1 max_steps: 20000 log_every_n_steps: 1 val_check_interval: 20000 check_val_every_n_epoch: null num_sanity_val_steps: 0 limit_val_batches: 1 limit_test_batches: 1 gradient_clip_val: 1.0 lnc: 2 sequential_move_factor: 11

Note

All the above trainer parameters follow the exact same definition of the PyTorch-Lightning Trainer. More information about each of them can be foundhere.

devices

Number of devices to be used for training. If using torchrun, this is equal to nproc_per_node * num_nodes.

lnc

Neuron-specific setting that specifies the logical-to-physical Neuron Core mapping ratio. This parameter determines the number of physical Neuron cores used for each logical Neuron Core.

Values:

num_nodes

Number of nodes to be used for training

max_epochs

Maximum number of epochs to run. A value of -1 means that the number of training steps would be inferred from max_steps

log_every_n_steps

How often to log loss values

val_check_interval

How often to run validation step. Using this parameter one can run validation step after X training steps.

check_val_every_n_epoch

Another parameter that controls the frequency of validation step. Using this parameter, one can run valiation step after X epochs.

num_sanity_val_steps

How many sanity validation steps to run. Keeping it to 0 would not run validation step at the start of training.

limit_val_batches

Number of batches to run validation step on.

gradient_clip_val

Float value to clip gradients at.

sequential_move_factor

Number of ranks/devices participating in initializing the model weights in parallel. Useful to reduce init time when using TP-PP config. The value can be increased upto the number of trainer.devices being used.

Experiment Manager#

This setting is mainly for configuring different aspects of experiment management like checkpointing, experiment logging directory, which parameters to log and how often to log, etc.

log_local_rank_0_only: True create_tensorboard_logger: True explicit_log_dir: null exp_dir: null name: megatron_llama resume_if_exists: True resume_ignore_no_checkpoint: True create_checkpoint_callback: True checkpoint_callback_params: monitor: step save_top_k: 1 mode: max save_last: False filename: 'megatron_llama--{step}-{consumed_samples}' every_n_train_steps: 200 use_master_weights_in_ckpt: False log_parameter_norm: True log_gradient_norm: True enable_recovery_time_instrumentation: False save_xser: True load_xser: True async_checkpointing: False resume_from_checkpoint: null

log_local_rank_0_only

Log only on rank 0. The recommended setting should be True

create_tensorboard_logger

Setting this True would log the loss and other parameters to tensorboard.

exp_log_dir

Explicitly specify the logging directory. Otherwise, the framework would save to current directory as default.

resume_if_exists

Set this to True to resume from an existing checkpoint. This config will be useful when we want to auto-resume from a failed training job.

resume_ignore_no_checkpoint

Experiment manager errors out if resume_if_exists is True and no checkpoint could be found. This behaviour can be disabled, in which case exp_manager will print a message and continue without restoring, by setting resume_ignore_no_checkpoint to True.

checkpoint_callback_params.save_top_k

How many checkpoints to keep around. Example: If set to 1, only 1 checkpoint at any given time would be kept around. The framework would automatically keep deleting checkpoints.

checkpoint_callback_params.every_n_train_steps

How often we want to checkpoint.

checkpoint_callback_params.use_master_weights_in_ckpt

Whether or not to save master weights when checkpointing.

log_parameter_norm

Set this to log parameter norm across model parallel ranks.

log_gradient_norm

Set this to log gradient norm across model parallel ranks.

enable_recovery_time_instrumentation

Set this if you don’t want to default to not printing the detailing timing for recovery.

save_xser

Set this to save with torch xla serialization to reduce time saving, it’s recommended to enable xserfor significantly faster save/load. Note that if the checkpoint is saved with xser, it can only be loaded with xser, vice versa.

load_xser

Set this to load with torch xla serialization to reduce time saving, it’s recommended to enable xser for significantly faster save/load. Note that if the checkpoint is saved with xser, it can only be loaded with xser, vice versa.

async_checkpointing

Set this if you want to use async checkpointing. Under the hood the library uses the async checkpointing feature provided by NeuronxDistributed’ssave API.

resume_from_checkpoint

Set this as the checkpoint file to load from. Check the SFT/DPO/ORPO example config under conf on how to use it.

ckpt_ptl_version

Set this only if your checkpoint does not contain the pytorch-lightning version in it. This version is the pytorch-lightning version the checkpoint was saved with.

Distributed Strategy#

tensor_model_parallel_size: 8 pipeline_model_parallel_size: 1 virtual_pipeline_model_parallel_size: 1 zero1: True sequence_parallel: True kv_replicator: 4

This setting allows users to configure the sharding strategy to be used for distributing the model across workers.

tensor_model_parallel_size

Tensor parallel degreeto be used for sharding models.

pipeline_model_parallel_size

Pipeline parallel degreeto be used for sharding models.

virtual_pipeline_model_parallel_size

Interleaved pipeline parallel degree. Use a value of 1 if no pipeline parallelism is used.

context_parallel_size

Context parallel degree to be used for sharding sequence. When context_parallel_size is greater than 1,fusions.ring_attention must be set to True.

zero1

Wraps the optimizer with zero1.

sequence_parallel

To shard along the sequence dimension. Sequence Parallel is always used in conjuction with tensor parallel. The sequence dimension will be sharded with the same degree as the tensor_model_parallel_size.

kv_replicator

This parameter is used together with qkv_linear parameter. It is used to configure theGQAQKVLinear module

Data#

This is where we configure the dataset/dataloader. This config is dependent on the dataloader/dataset been used. Users can add custom keys in this config and read inside the CustomDataModule using cfg.data. Currently the library adds support for 3 kinds of data modules: MegatronDataModule, ModelAlignmentDataModuleand HFDataModule. To learn about the config parameters of MegatronDataModule please check themegatron_llama_7B_config.yaml, for ModelAlignmentDataModule check the megatron_llama2_7B_SFT_config.yamland for HFDataModule, refer to hf_llama3_8B_config.yaml.

The parameters that are common across all the configs are documented below.

micro_batch_size: 1 global_batch_size: 1024

micro_batch_size

The batch is distributed across multiple data parallel ranks and within each rank, we accumulate gradients. Micro batch size is the size that is used for each of those gradient calculation steps.

global_batch_size

This config along with micro-batchsize decides the gradient accumulation number automatically.

Model#

This is where we can configure the model architecture. When building custom models, this config can be used to parameterize the custom model. The below parameters are taken from an example of the Megatron model config. Depending on the model and required parameters, this config can change.

HF Model#

Let’s start with the config for the HF model:

model architecture

model_config: /home/ubuntu/config.json encoder_seq_length: 4096 max_position_embeddings: ${.encoder_seq_length} num_layers: 4 hidden_size: 4096 qkv_linear: False

Miscellaneous

use_cpu_initialization: True

Activation Checkpointing

activations_checkpoint_granularity: selective activations_checkpoint_recompute: [CoreAttention]

fusions: softmax: True flash_attention: False

do_layer_norm_weight_decay: False

optim: name: adamw_fp32OptState lr: 3e-4 weight_decay: 0.01 capturable: False betas: - 0.9 - 0.999 sched: name: LinearAnnealingWithWarmUp warmup_steps: 100 max_steps: ${trainer.max_steps}

model_config

Points to the config.json path required by the transformers model implementation. One such example ofconfig.json is here

encoder_seq_length

Setting the sequence length for the training job. This parameter is common for all models supported in the library.

num_layers

This config will override the number of layers inside the config.json in the model_config. This is exposed so that one can quickly increase/decrease the size of the model. This parameter is common for all models supported in the library.

hidden_size

This config will override the hidden_size inside the config.json in the model_config. This parameter is common for all models supported in the library.

qkv_linear

This needs to be set if users want to use theGQAQKVLinear module

fuse_qkv

This is set if users want to use fused q, k and v tensors inGQAQKVLinear module Using fuse_qkv can improve throughput. This parameter is True by default.

transpose_nki_inputs

This is set if users want to transpose the inputs to NKI FlashAttention function. To be used only whenfusions.flash_attention is True. Using transpose_nki_inputs with fusions.flash_attentioncan improve throughput. This parameter is True by default for all models, unless used otherwise.

pipeline_cuts

This is set as a list of layer names if users want to specify manual cut points for pipeline parallelism. One example is [‘model.layers.10’, ‘model.layers.20’] in the case of PP=3.

Note

When using this param, the number of pipeline cuts should always be pipeline_model_parallel_size-1.

use_cpu_initialization

Setting this flag to True will initialize the weights on CPU and then move to device. It is recommended to set this flag to True. This parameter is common for all models supported in the library.

activations_checkpoint_granularity

This flag controls which module needs to be recomputed during the backward pass.

Values:

More information on activation recompute can be foundin this link. This parameter is common for all models supported in the library.

activations_checkpoint_recomputeThis config specifies which modules to recompute when using selective activation checkpointing. It accepts a list of module names as strings or null.

fusions.softmax

Setting this flag to True will replace the torch.nn.Softmax with a fused custom Softmax operator. This parameter is common for all models supported in the library.

fusions.flash_attention

Setting this flag to True will insert the flash attention module for both forward and backward. This parameter is common for all models supported in the library.

fusions.ring_attention

Setting this flag to True will use the ring attention module for both forward and backward. This parameter must be true when context_parallel_sizeis greater than 1.

fusions.do_layer_norm_weight_decay

Setting this flag to True will add layer norm weight decay. This parameter is common for all models supported in the library.

optim

This is where the optimizers can be set. We can configure the optimizers supported by NeMo. All the optimzers can be configured according to theparameters specified here.

optim.sched

This is where the LR schedulers can be set. We can configure the schedulers supported by NeMo. All the schedulers can be configured according to theparameters specified here.

Megatron Model#

The library enables amegatron transformermodel which can be configured from the yaml file. The different available parameters are documented below after the following reference example.

model architecture

encoder_seq_length: 4096 max_position_embeddings: ${.encoder_seq_length} num_layers: 32 hidden_size: 4096 ffn_hidden_size: 11008 num_attention_heads: 32 num_kv_heads: 32 init_method_std: 0.021 hidden_dropout: 0 attention_dropout: 0 ffn_dropout: 0 apply_query_key_layer_scaling: True normalization: 'rmsnorm' layernorm_epsilon: 1e-5 do_layer_norm_weight_decay: False # True means weight decay on all params make_vocab_size_divisible_by: 8 # Pad the vocab size to be divisible by this value for computation efficiency. persist_layer_norm: True # Use of persistent fused layer norm kernel. share_embeddings_and_output_weights: False # Untie embedding and output layer weights. position_embedding_type: 'rope' # Position embedding type. Options ['learned_absolute', 'rope] rotary_percentage: 1 # If using position_embedding_type=rope, then the per head dim is multiplied by this. activation: 'swiglu' # ['swiglu', 'gelu'] has_bias: False

Miscellaneous

use_cpu_initialization: True

Activation Checkpointing

activations_checkpoint_granularity: selective # 'selective' or 'full'

fusions: softmax: True flash_attention: False # Use NKI flash attention

optim: name: adamw lr: 3e-4 weight_decay: 0.1 capturable: True betas: - 0.9 - 0.95 sched: name: CosineAnnealing warmup_steps: 2000 constant_steps: 0 min_lr: 3.0e-5

Note

For common config, please refer to the HF Model section above.

ffn_hidden_size

Transformer FFN hidden size.

num_attention_heads

Number of Q attention heads.

num_kv_heads

Number of KV heads. This is where we can configure Q and KV differently to create GQA modules.

init_method_std

Standard deviation to use when we init layers of the transformer model.

hidden_dropout

Dropout probability for hidden state transformer.

attention_dropout

Dropout probability in the attention layer.

ffn_dropout

Dropout probability in the feed-forward layer.

apply_query_key_layer_scaling

Scale Q * K^T by (1 / layer-number).

normalization

Normalization layer to use.

layernorm_epsilon

Epsilon value for layernorm.

share_embeddings_and_output_weights

Setting this parameter to True will tie the vocab embedding weight with the final MLP weight.

make_vocab_size_divisible_by

So lets say your vocab size is 31999 and you set this value to 4, the framework would pad the vocab-size such that it becomes divisible by 4. In this case the close divisible value is 32K.

position_embedding_type

Type of position embedding to be used.

rotary_percentage

If using position_embedding_type=rope, then the per head dim is multiplied by this factor.

activation

Users can specify the activation function to be used in the model.

has_bias

Setting this parameter to True will add bias to each of the linear layers in the model.

Precision#

This config can help to decide the dtype of the model/optimizer.

precision: type: 'mixed_precision' # ['bf16SR', 'fp32', 'autocast', 'mixed_precision', 'mixed_precisionSR', 'manual'] # Set the following only if precision type is manual, otherwise they will be automatically set. master_weights: False fp32_grad_acc: False xla_use_bf16: '0' xla_downcast_bf16: '0' neuron_rt_stochastic_rounding_en: '0' parallel_layers_reduce_dtype: 'bf16'

Note

Only if the precision type is manual, master_weights , fp32_grad_acc, xla_use_bf16, xla_downcast_bf16,neuron_rt_stochastic_rounding_en will be picked up from the config. These parameters are for more finer control of precision. It is recommended to use mixed_precision config for better accuracy.

type

mixed_precision

The mixed_precision config uses the zero1 optimizer. It performs grad accumulation,grad cc, and keeps the master copy of the weights in fp32. It also sets the xla_downcast_bf16environment variable to 1 and disables stochastic rounding.

mixed_precisionSR

mixed_precisionSR is a superset of the mixed_precision config with stochastic rounding enabled.

bf16SR

bf16SR config will perform all operations in bf16 and relies on stochastic rounding feature for accuracy gains.

autocast

autocast config will follow the exact same precision strategy followed by torch.autocast.

Note

Autocast is not supported in this release.

manual

To gain control of the different precision nobs, one can set the precision type to manual and control parameters like - master_weights , fp32_grad_acc, xla_use_bf16, xla_downcast_bf16 andneuron_rt_stochastic_rounding_en.

parallel_layers_reduce_dtype

This config will perform reduce collectives (all-reduce and reduce-scatter) within parallel layers in the specified precision. If fp32 precision type is used, then we implicitly set reduce dtype to fp32. Otherwise it will be defaulted to bf16 in all other cases unless specified.

Model Alignment Specific#

You can configure fine-tuning (SFT) or model alignment (DPO/ORPO) through the YAML file, along with parameter-efficient fine-tuning using LoRA.

model_alignment_strategy: # DPO specific config dpo: kl_beta: 0.01 loss_type: sigmoid max_prompt_length: 2048 precompute_ref_log_probs: True truncation_mode: keep_start

# Alternatively, you can also use SFT specific config
sft:
    packing: True

# Alternatively, can also use ORPO specific config
orpo:
    beta: 0.01
    max_prompt_length: 2048
    truncation_mode: keep_start

# Parameter-efficient finetuning - LoRA config
peft:
    lora_rank: 16
    lora_alpha: 32
    lora_dropout: 0.05
    lora_bias: "none"
    lora_verbose: True
    target_modules: ["qkv_proj"]

model_alignment_strategy

Set only when using finetuning specific algorithms (SFT, DPO, etc) and related hyperparameters DPO-specific parameters.

dpo

kl_beta

KL-divergence beta to control divergence of policy model from reference model

  • Type: float
  • Default: 0.01
  • Required: True

loss_type

Currently support sigmoid version of optimized DPO loss

  • Type: str
  • Default: sigmoid
  • Required: True

max_prompt_length

Set maximum length of prompt in the concatenated prompt and (chosen/rejected) response input

  • Type: integer
  • Required: True

precompute_ref_log_probs

To enable precomputation of reference model log probabilities using pre-fit hook, False is not supported currently

  • Type: bool
  • Required: True

truncation_mode

To define how to truncate if size (prompt+response) exceeds seq_length options: [“keep_start”, “keep_end”]

  • Type: str
  • Default: keep_start`
  • Required: True

SFT-specific parameters.

sft

packing

Appends multiple records in a single record until seq length supported by model, if false uses pad tokens to reach seq length. Setting it to True increases throughput but might impact accuracy.

  • Type: bool
  • Default: False
  • Required: False

Odds Ratio Preference Optimization (ORPO)specific parameters.

orpo

beta

KL-divergence beta to control divergence of policy model from reference model

  • Type: float
  • Default: 0.01
  • Required: True

max_prompt_length

Set maximum length of prompt in the concatenated prompt and (chosen/rejected) response input

  • Type: integer
  • Required: True

truncation_mode

To define how to truncate if size (prompt+response) exceeds seq_length options: [“keep_start”, “keep_end”]

  • Type: str
  • Default: keep_start`
  • Required: True

peft

Configuration options for Parameter-Efficient Fine-Tuning (PEFT) methods, specifically LoRA settings.

lora_rank

Rank of LoRA; determines the number of trainable parameters Higher rank allows for more expressive adaptations but increases memory usage

  • Type: int
  • Default: 16
  • Required: True

lora_alpha

Scaling factor for LoRA updates; affects the magnitude of LoRA adaptations.

  • Type: int
  • Default: 32
  • Required: True

lora_dropout

Dropout rate for LoRA layers to prevent overfitting.

  • Type: float
  • Default: 0.05
  • Required: False

lora_bias

Bias type for LoRA. Determines which biases are trainable. Can be ‘none’, ‘all’ or ‘lora_only’

  • Type: str
  • Default: “none”
  • Required: False

lora_verbose

Enables detailed LoRA-related logging during training.

  • Type: bool
  • Default: False
  • Required: False

target_modules

List of model layers to apply LoRA.

  • Type: list[str]
  • Default: [“qkv_proj”] (for Llama)
  • Required: True

This document is relevant for: Trn1, Trn2