Trainer (original) (raw)

The Trainer class provides an API for feature-complete training in PyTorch, and it supports distributed training on multiple GPUs/TPUs, mixed precision for NVIDIA GPUs, AMD GPUs, and torch.amp for PyTorch. Trainer goes hand-in-hand with the TrainingArguments class, which offers a wide range of options to customize how a model is trained. Together, these two classes provide a complete training API.

class transformers.Trainer

< source >

( model: typing.Union[transformers.modeling_utils.PreTrainedModel, torch.nn.modules.module.Module, NoneType] = None args: TrainingArguments = None data_collator: typing.Optional[transformers.data.data_collator.DataCollator] = None train_dataset: typing.Union[torch.utils.data.dataset.Dataset, torch.utils.data.dataset.IterableDataset, ForwardRef('datasets.Dataset'), NoneType] = None eval_dataset: typing.Union[torch.utils.data.dataset.Dataset, dict[str, torch.utils.data.dataset.Dataset], ForwardRef('datasets.Dataset'), NoneType] = None processing_class: typing.Union[transformers.tokenization_utils_base.PreTrainedTokenizerBase, transformers.image_processing_utils.BaseImageProcessor, transformers.feature_extraction_utils.FeatureExtractionMixin, transformers.processing_utils.ProcessorMixin, NoneType] = None model_init: typing.Optional[typing.Callable[[], transformers.modeling_utils.PreTrainedModel]] = None compute_loss_func: typing.Optional[typing.Callable] = None compute_metrics: typing.Optional[typing.Callable[[transformers.trainer_utils.EvalPrediction], dict]] = None callbacks: typing.Optional[list[transformers.trainer_callback.TrainerCallback]] = None optimizers: tuple = (None, None) optimizer_cls_and_kwargs: typing.Optional[tuple[type[torch.optim.optimizer.Optimizer], dict[str, typing.Any]]] = None preprocess_logits_for_metrics: typing.Optional[typing.Callable[[torch.Tensor, torch.Tensor], torch.Tensor]] = None )

Parameters

Trainer is a simple but feature-complete training and eval loop for PyTorch, optimized for 🤗 Transformers.

Important attributes:

add_callback

< source >

( callback )

Parameters

Add a callback to the current list of TrainerCallback.

autocast_smart_context_manager

< source >

( cache_enabled: typing.Optional[bool] = True )

A helper wrapper that creates an appropriate context manager for autocast while feeding it the desired arguments, depending on the situation.

compute_loss

< source >

( model inputs return_outputs = False num_items_in_batch = None )

How the loss is computed by Trainer. By default, all models return the loss in the first element.

Subclass and override for custom behavior.

A helper wrapper to group together context managers.

create_model_card

< source >

( language: typing.Optional[str] = None license: typing.Optional[str] = None tags: typing.Union[str, list[str], NoneType] = None model_name: typing.Optional[str] = None finetuned_from: typing.Optional[str] = None tasks: typing.Union[str, list[str], NoneType] = None dataset_tags: typing.Union[str, list[str], NoneType] = None dataset: typing.Union[str, list[str], NoneType] = None dataset_args: typing.Union[str, list[str], NoneType] = None )

Parameters

Creates a draft of a model card using the information available to the Trainer.

Setup the optimizer.

We provide a reasonable default that works well. If you want to use something else, you can pass a tuple in the Trainer’s init through optimizers, or subclass and override this method in a subclass.

create_optimizer_and_scheduler

< source >

( num_training_steps: int )

Setup the optimizer and the learning rate scheduler.

We provide a reasonable default that works well. If you want to use something else, you can pass a tuple in the Trainer’s init through optimizers, or subclass and override this method (or create_optimizer and/orcreate_scheduler) in a subclass.

create_scheduler

< source >

( num_training_steps: int optimizer: Optimizer = None )

Parameters

Setup the scheduler. The optimizer of the trainer must have been set up either before this method is called or passed as an argument.

evaluate

< source >

( eval_dataset: typing.Union[torch.utils.data.dataset.Dataset, dict[str, torch.utils.data.dataset.Dataset], NoneType] = None ignore_keys: typing.Optional[list[str]] = None metric_key_prefix: str = 'eval' )

Parameters

Run evaluation and returns metrics.

The calling script will be responsible for providing a method to compute metrics, as they are task-dependent (pass it to the init compute_metrics argument).

You can also subclass and override this method to inject custom behavior.

evaluation_loop

< source >

( dataloader: DataLoader description: str prediction_loss_only: typing.Optional[bool] = None ignore_keys: typing.Optional[list[str]] = None metric_key_prefix: str = 'eval' )

Prediction/evaluation loop, shared by Trainer.evaluate() and Trainer.predict().

Works both with or without labels.

floating_point_ops

< source >

( inputs: dict ) → int

Parameters

The number of floating-point operations.

For models that inherit from PreTrainedModel, uses that method to compute the number of floating point operations for every backward + forward pass. If using another model, either implement such a method in the model or subclass and override this method.

Get all parameter names that weight decay will be applied to.

This function filters out parameters in two ways:

  1. By layer type (instances of layers specified in ALL_LAYERNORM_LAYERS)
  2. By parameter name patterns (containing ‘bias’, ‘layernorm’, or ‘rmsnorm’)

get_eval_dataloader

< source >

( eval_dataset: typing.Union[str, torch.utils.data.dataset.Dataset, NoneType] = None )

Parameters

Returns the evaluation ~torch.utils.data.DataLoader.

Subclass and override this method if you want to inject some custom behavior.

Returns the learning rate of each parameter from self.optimizer.

Get the number of trainable parameters.

get_optimizer_cls_and_kwargs

< source >

( args: TrainingArguments model: typing.Optional[transformers.modeling_utils.PreTrainedModel] = None )

Parameters

Returns the optimizer class and optimizer parameters based on the training arguments.

get_optimizer_group

< source >

( param: typing.Union[str, torch.nn.parameter.Parameter, NoneType] = None )

Parameters

Returns optimizer group for a parameter if given, else returns all optimizer groups for params.

get_test_dataloader

< source >

( test_dataset: Dataset )

Parameters

Returns the test ~torch.utils.data.DataLoader.

Subclass and override this method if you want to inject some custom behavior.

Returns the training ~torch.utils.data.DataLoader.

Will use no sampler if train_dataset does not implement __len__, a random sampler (adapted to distributed training if necessary) otherwise.

Subclass and override this method if you want to inject some custom behavior.

< source >

( hp_space: typing.Optional[typing.Callable[[ForwardRef('optuna.Trial')], dict[str, float]]] = None compute_objective: typing.Optional[typing.Callable[[dict[str, float]], float]] = None n_trials: int = 20 direction: typing.Union[str, list[str]] = 'minimize' backend: typing.Union[ForwardRef('str'), transformers.trainer_utils.HPSearchBackend, NoneType] = None hp_name: typing.Optional[typing.Callable[[ForwardRef('optuna.Trial')], str]] = None **kwargs ) → [trainer_utils.BestRun or List[trainer_utils.BestRun]]

Parameters

Returns

[trainer_utils.BestRun or List[trainer_utils.BestRun]]

All the information about the best run or best runs for multi-objective optimization. Experiment summary can be found in run_summary attribute for Ray backend.

Launch an hyperparameter search using optuna or Ray Tune or SigOpt. The optimized quantity is determined by compute_objective, which defaults to a function returning the evaluation loss when no metric is provided, the sum of all metrics otherwise.

To use this method, you need to have provided a model_init when initializing your Trainer: we need to reinitialize the model at each new run. This is incompatible with the optimizers argument, so you need to subclass Trainer and override the method create_optimizer_and_scheduler() for custom optimizer/scheduler.

init_hf_repo

< source >

( token: typing.Optional[str] = None )

Initializes a git repo in self.args.hub_model_id.

Whether or not this process is the local (e.g., on one machine if training in a distributed fashion on several machines) main process.

Whether or not this process is the global main process (when training in a distributed fashion on several machines, this is only going to be True for one process).

log

< source >

( logs: dict start_time: typing.Optional[float] = None )

Parameters

Log logs on the various objects watching training.

Subclass and override this method to inject custom behavior.

log_metrics

< source >

( split metrics )

Parameters

Log metrics in a specially formatted way.

Under distributed environment this is done only for a process with rank 0.

Notes on memory reports:

In order to get memory usage report you need to install psutil. You can do that with pip install psutil.

Now when this method is run, you will see a report that will include:

init_mem_cpu_alloc_delta = 1301MB init_mem_cpu_peaked_delta = 154MB init_mem_gpu_alloc_delta = 230MB init_mem_gpu_peaked_delta = 0MB train_mem_cpu_alloc_delta = 1345MB train_mem_cpu_peaked_delta = 0MB train_mem_gpu_alloc_delta = 693MB train_mem_gpu_peaked_delta = 7MB

Understanding the reports:

The reporting happens only for process of rank 0 and gpu 0 (if there is a gpu). Typically this is enough since the main process does the bulk of work, but it could be not quite so if model parallel is used and then other GPUs may use a different amount of gpu memory. This is also not the same under DataParallel where gpu0 may require much more memory than the rest since it stores the gradient and optimizer states for all participating GPUs. Perhaps in the future these reports will evolve to measure those too.

The CPU RAM metric measures RSS (Resident Set Size) includes both the memory which is unique to the process and the memory shared with other processes. It is important to note that it does not include swapped out memory, so the reports could be imprecise.

The CPU peak memory is measured using a sampling thread. Due to python’s GIL it may miss some of the peak memory if that thread didn’t get a chance to run when the highest memory was used. Therefore this report can be less than reality. Using tracemalloc would have reported the exact peak memory, but it doesn’t report memory allocations outside of python. So if some C++ CUDA extension allocated its own memory it won’t be reported. And therefore it was dropped in favor of the memory sampling approach, which reads the current process memory usage.

The GPU allocated and peak memory reporting is done with torch.cuda.memory_allocated() andtorch.cuda.max_memory_allocated(). This metric reports only “deltas” for pytorch-specific allocations, astorch.cuda memory management system doesn’t track any memory allocated outside of pytorch. For example, the very first cuda call typically loads CUDA kernels, which may take from 0.5 to 2GB of GPU memory.

Note that this tracker doesn’t account for memory allocations outside of Trainer’s __init__, train,evaluate and predict calls.

Because evaluation calls may happen during train, we can’t handle nested invocations becausetorch.cuda.max_memory_allocated is a single counter, so if it gets reset by a nested eval call, train’s tracker will report incorrect info. If this pytorch issue gets resolved it will be possible to change this class to be re-entrant. Until then we will only track the outer level oftrain, evaluate and predict methods. Which means that if eval is called during train, it’s the latter that will account for its memory usage and that of the former.

This also means that if any other tool that is used along the Trainer callstorch.cuda.reset_peak_memory_stats, the gpu peak memory stats could be invalid. And the Trainer will disrupt the normal behavior of any such tools that rely on calling torch.cuda.reset_peak_memory_stats themselves.

For best performance you may want to consider turning the memory profiling off for production runs.

metrics_format

< source >

( metrics: dict ) → metrics (Dict[str, float])

Parameters

Returns

metrics (Dict[str, float])

The reformatted metrics

Reformat Trainer metrics values to a human-readable format.

Helper to get number of samples in a ~torch.utils.data.DataLoader by accessing its dataset. When dataloader.dataset does not exist or has no length, estimates as best it can

num_tokens

< source >

( train_dl: DataLoader max_steps: typing.Optional[int] = None )

Helper to get number of tokens in a ~torch.utils.data.DataLoader by enumerating dataloader.

pop_callback

< source >

( callback ) → TrainerCallback

Parameters

The callback removed, if found.

Remove a callback from the current list of TrainerCallback and returns it.

If the callback is not found, returns None (and no error is raised).

predict

< source >

( test_dataset: Dataset ignore_keys: typing.Optional[list[str]] = None metric_key_prefix: str = 'test' )

Parameters

Run prediction and returns predictions and potential metrics.

Depending on the dataset and your use case, your test dataset may contain labels. In that case, this method will also return metrics, like in evaluate().

If your predictions or labels have different sequence length (for instance because you’re doing dynamic padding in a token classification task) the predictions will be padded (on the right) to allow for concatenation into one array. The padding index is -100.

Returns: NamedTuple A namedtuple with the following keys:

prediction_loop

< source >

( dataloader: DataLoader description: str prediction_loss_only: typing.Optional[bool] = None ignore_keys: typing.Optional[list[str]] = None metric_key_prefix: str = 'eval' )

Prediction/evaluation loop, shared by Trainer.evaluate() and Trainer.predict().

Works both with or without labels.

prediction_step

< source >

( model: Module inputs: dict prediction_loss_only: bool ignore_keys: typing.Optional[list[str]] = None ) → Tuple[Optional[torch.Tensor], Optional[torch.Tensor], Optional[torch.Tensor]]

Parameters

Returns

Tuple[Optional[torch.Tensor], Optional[torch.Tensor], Optional[torch.Tensor]]

A tuple with the loss, logits and labels (each being optional).

Perform an evaluation step on model using inputs.

Subclass and override to inject custom behavior.

propagate_args_to_deepspeed

< source >

( auto_find_batch_size = False )

Sets values in the deepspeed plugin based on the Trainer args

push_to_hub

< source >

( commit_message: typing.Optional[str] = 'End of training' blocking: bool = True token: typing.Optional[str] = None revision: typing.Optional[str] = None **kwargs )

Parameters

Upload self.model and self.processing_class to the 🤗 model hub on the repo self.args.hub_model_id.

remove_callback

< source >

( callback )

Parameters

Remove a callback from the current list of TrainerCallback.

save_metrics

< source >

( split metrics combined = True )

Parameters

Save metrics into a json file for that split, e.g. train_results.json.

Under distributed environment this is done only for a process with rank 0.

To understand the metrics please read the docstring of log_metrics(). The only difference is that raw unformatted numbers are saved in the current method.

save_model

< source >

( output_dir: typing.Optional[str] = None _internal_call: bool = False )

Will save the model, so you can reload it using from_pretrained().

Will only save from the main process.

Saves the Trainer state, since Trainer.save_model saves only the tokenizer with the model.

Under distributed environment this is done only for a process with rank 0.

set_initial_training_values

< source >

( args: TrainingArguments dataloader: DataLoader total_train_batch_size: int )

Calculates and returns the following values:

train

< source >

( resume_from_checkpoint: typing.Union[bool, str, NoneType] = None trial: typing.Union[ForwardRef('optuna.Trial'), dict[str, typing.Any], NoneType] = None ignore_keys_for_eval: typing.Optional[list[str]] = None **kwargs )

Parameters

Main training entry point.

training_step

< source >

( model: Module inputs: dict num_items_in_batch = None ) → torch.Tensor

Parameters

The tensor with training loss on this batch.

Perform a training step on a batch of inputs.

Subclass and override to inject custom behavior.

class transformers.Seq2SeqTrainer

< source >

( model: typing.Union[ForwardRef('PreTrainedModel'), torch.nn.modules.module.Module] = None args: TrainingArguments = None data_collator: typing.Optional[ForwardRef('DataCollator')] = None train_dataset: typing.Union[torch.utils.data.dataset.Dataset, ForwardRef('IterableDataset'), ForwardRef('datasets.Dataset'), NoneType] = None eval_dataset: typing.Union[torch.utils.data.dataset.Dataset, dict[str, torch.utils.data.dataset.Dataset], NoneType] = None processing_class: typing.Union[ForwardRef('PreTrainedTokenizerBase'), ForwardRef('BaseImageProcessor'), ForwardRef('FeatureExtractionMixin'), ForwardRef('ProcessorMixin'), NoneType] = None model_init: typing.Optional[typing.Callable[[], ForwardRef('PreTrainedModel')]] = None compute_loss_func: typing.Optional[typing.Callable] = None compute_metrics: typing.Optional[typing.Callable[[ForwardRef('EvalPrediction')], dict]] = None callbacks: typing.Optional[list['TrainerCallback']] = None optimizers: tuple = (None, None) preprocess_logits_for_metrics: typing.Optional[typing.Callable[[torch.Tensor, torch.Tensor], torch.Tensor]] = None )

evaluate

< source >

( eval_dataset: typing.Optional[torch.utils.data.dataset.Dataset] = None ignore_keys: typing.Optional[list[str]] = None metric_key_prefix: str = 'eval' **gen_kwargs )

Parameters

Run evaluation and returns metrics.

The calling script will be responsible for providing a method to compute metrics, as they are task-dependent (pass it to the init compute_metrics argument).

You can also subclass and override this method to inject custom behavior.

predict

< source >

( test_dataset: Dataset ignore_keys: typing.Optional[list[str]] = None metric_key_prefix: str = 'test' **gen_kwargs )

Parameters

Run prediction and returns predictions and potential metrics.

Depending on the dataset and your use case, your test dataset may contain labels. In that case, this method will also return metrics, like in evaluate().

If your predictions or labels have different sequence lengths (for instance because you’re doing dynamic padding in a token classification task) the predictions will be padded (on the right) to allow for concatenation into one array. The padding index is -100.

Returns: NamedTuple A namedtuple with the following keys:

class transformers.TrainingArguments

< source >

( output_dir: typing.Optional[str] = None overwrite_output_dir: bool = False do_train: bool = False do_eval: bool = False do_predict: bool = False eval_strategy: typing.Union[transformers.trainer_utils.IntervalStrategy, str] = 'no' prediction_loss_only: bool = False per_device_train_batch_size: int = 8 per_device_eval_batch_size: int = 8 per_gpu_train_batch_size: typing.Optional[int] = None per_gpu_eval_batch_size: typing.Optional[int] = None gradient_accumulation_steps: int = 1 eval_accumulation_steps: typing.Optional[int] = None eval_delay: typing.Optional[float] = 0 torch_empty_cache_steps: typing.Optional[int] = None learning_rate: float = 5e-05 weight_decay: float = 0.0 adam_beta1: float = 0.9 adam_beta2: float = 0.999 adam_epsilon: float = 1e-08 max_grad_norm: float = 1.0 num_train_epochs: float = 3.0 max_steps: int = -1 lr_scheduler_type: typing.Union[transformers.trainer_utils.SchedulerType, str] = 'linear' lr_scheduler_kwargs: typing.Union[dict, str, NoneType] = warmup_ratio: float = 0.0 warmup_steps: int = 0 log_level: typing.Optional[str] = 'passive' log_level_replica: typing.Optional[str] = 'warning' log_on_each_node: bool = True logging_dir: typing.Optional[str] = None logging_strategy: typing.Union[transformers.trainer_utils.IntervalStrategy, str] = 'steps' logging_first_step: bool = False logging_steps: float = 500 logging_nan_inf_filter: bool = True save_strategy: typing.Union[transformers.trainer_utils.SaveStrategy, str] = 'steps' save_steps: float = 500 save_total_limit: typing.Optional[int] = None save_safetensors: typing.Optional[bool] = True save_on_each_node: bool = False save_only_model: bool = False restore_callback_states_from_checkpoint: bool = False no_cuda: bool = False use_cpu: bool = False use_mps_device: bool = False seed: int = 42 data_seed: typing.Optional[int] = None jit_mode_eval: bool = False use_ipex: bool = False bf16: bool = False fp16: bool = False fp16_opt_level: str = 'O1' half_precision_backend: str = 'auto' bf16_full_eval: bool = False fp16_full_eval: bool = False tf32: typing.Optional[bool] = None local_rank: int = -1 ddp_backend: typing.Optional[str] = None tpu_num_cores: typing.Optional[int] = None tpu_metrics_debug: bool = False debug: typing.Union[str, list[transformers.debug_utils.DebugOption]] = '' dataloader_drop_last: bool = False eval_steps: typing.Optional[float] = None dataloader_num_workers: int = 0 dataloader_prefetch_factor: typing.Optional[int] = None past_index: int = -1 run_name: typing.Optional[str] = None disable_tqdm: typing.Optional[bool] = None remove_unused_columns: typing.Optional[bool] = True label_names: typing.Optional[list[str]] = None load_best_model_at_end: typing.Optional[bool] = False metric_for_best_model: typing.Optional[str] = None greater_is_better: typing.Optional[bool] = None ignore_data_skip: bool = False fsdp: typing.Union[list[transformers.trainer_utils.FSDPOption], str, NoneType] = '' fsdp_min_num_params: int = 0 fsdp_config: typing.Union[dict, str, NoneType] = None tp_size: typing.Optional[int] = 0 fsdp_transformer_layer_cls_to_wrap: typing.Optional[str] = None accelerator_config: typing.Union[dict, str, NoneType] = None deepspeed: typing.Union[dict, str, NoneType] = None label_smoothing_factor: float = 0.0 optim: typing.Union[transformers.training_args.OptimizerNames, str] = 'adamw_torch' optim_args: typing.Optional[str] = None adafactor: bool = False group_by_length: bool = False length_column_name: typing.Optional[str] = 'length' report_to: typing.Union[NoneType, str, list[str]] = None ddp_find_unused_parameters: typing.Optional[bool] = None ddp_bucket_cap_mb: typing.Optional[int] = None ddp_broadcast_buffers: typing.Optional[bool] = None dataloader_pin_memory: bool = True dataloader_persistent_workers: bool = False skip_memory_metrics: bool = True use_legacy_prediction_loop: bool = False push_to_hub: bool = False resume_from_checkpoint: typing.Optional[str] = None hub_model_id: typing.Optional[str] = None hub_strategy: typing.Union[transformers.trainer_utils.HubStrategy, str] = 'every_save' hub_token: typing.Optional[str] = None hub_private_repo: typing.Optional[bool] = None hub_always_push: bool = False gradient_checkpointing: bool = False gradient_checkpointing_kwargs: typing.Union[dict, str, NoneType] = None include_inputs_for_metrics: bool = False include_for_metrics: list = eval_do_concat_batches: bool = True fp16_backend: str = 'auto' push_to_hub_model_id: typing.Optional[str] = None push_to_hub_organization: typing.Optional[str] = None push_to_hub_token: typing.Optional[str] = None mp_parameters: str = '' auto_find_batch_size: bool = False full_determinism: bool = False torchdynamo: typing.Optional[str] = None ray_scope: typing.Optional[str] = 'last' ddp_timeout: typing.Optional[int] = 1800 torch_compile: bool = False torch_compile_backend: typing.Optional[str] = None torch_compile_mode: typing.Optional[str] = None include_tokens_per_second: typing.Optional[bool] = False include_num_input_tokens_seen: typing.Optional[bool] = False neftune_noise_alpha: typing.Optional[float] = None optim_target_modules: typing.Union[NoneType, str, list[str]] = None batch_eval_metrics: bool = False eval_on_start: bool = False use_liger_kernel: typing.Optional[bool] = False eval_use_gather_object: typing.Optional[bool] = False average_tokens_across_devices: typing.Optional[bool] = False )

Parameters

TrainingArguments is the subset of the arguments we use in our example scripts which relate to the training loop itself.

Using HfArgumentParser we can turn this class intoargparse arguments that can be specified on the command line.

Returns the log level to be used depending on whether this process is the main process of node 0, main process of node non-0, or a non-main process.

For the main process the log level defaults to the logging level set (logging.WARNING if you didn’t do anything) unless overridden by log_level argument.

For the replica processes the log level defaults to logging.WARNING unless overridden by log_level_replicaargument.

The choice between the main and replica process settings is made according to the return value of should_log.

get_warmup_steps

< source >

( num_training_steps: int )

Get number of steps used for a linear warmup.

main_process_first

< source >

( local = True desc = 'work' )

Parameters

A context manager for torch distributed environment where on needs to do something on the main process, while blocking replicas, and when it’s finished releasing the replicas.

One such use is for datasets’s map feature which to be efficient should be run once on the main process, which upon completion saves a cached version of results and which then automatically gets loaded by the replicas.

set_dataloader

< source >

( train_batch_size: int = 8 eval_batch_size: int = 8 drop_last: bool = False num_workers: int = 0 pin_memory: bool = True persistent_workers: bool = False prefetch_factor: typing.Optional[int] = None auto_find_batch_size: bool = False ignore_data_skip: bool = False sampler_seed: typing.Optional[int] = None )

Parameters

A method that regroups all arguments linked to the dataloaders creation.

Example:

from transformers import TrainingArguments

args = TrainingArguments("working_dir") args = args.set_dataloader(train_batch_size=16, eval_batch_size=64) args.per_device_train_batch_size 16

set_evaluate

< source >

( strategy: typing.Union[str, transformers.trainer_utils.IntervalStrategy] = 'no' steps: int = 500 batch_size: int = 8 accumulation_steps: typing.Optional[int] = None delay: typing.Optional[float] = None loss_only: bool = False jit_mode: bool = False )

Parameters

A method that regroups all arguments linked to evaluation.

Example:

from transformers import TrainingArguments

args = TrainingArguments("working_dir") args = args.set_evaluate(strategy="steps", steps=100) args.eval_steps 100

set_logging

< source >

( strategy: typing.Union[str, transformers.trainer_utils.IntervalStrategy] = 'steps' steps: int = 500 report_to: typing.Union[str, list[str]] = 'none' level: str = 'passive' first_step: bool = False nan_inf_filter: bool = False on_each_node: bool = False replica_level: str = 'passive' )

Parameters

A method that regroups all arguments linked to logging.

Example:

from transformers import TrainingArguments

args = TrainingArguments("working_dir") args = args.set_logging(strategy="steps", steps=100) args.logging_steps 100

set_lr_scheduler

< source >

( name: typing.Union[str, transformers.trainer_utils.SchedulerType] = 'linear' num_epochs: float = 3.0 max_steps: int = -1 warmup_ratio: float = 0 warmup_steps: int = 0 )

Parameters

A method that regroups all arguments linked to the learning rate scheduler and its hyperparameters.

Example:

from transformers import TrainingArguments

args = TrainingArguments("working_dir") args = args.set_lr_scheduler(name="cosine", warmup_ratio=0.05) args.warmup_ratio 0.05

set_optimizer

< source >

( name: typing.Union[str, transformers.training_args.OptimizerNames] = 'adamw_torch' learning_rate: float = 5e-05 weight_decay: float = 0 beta1: float = 0.9 beta2: float = 0.999 epsilon: float = 1e-08 args: typing.Optional[str] = None )

Parameters

A method that regroups all arguments linked to the optimizer and its hyperparameters.

Example:

from transformers import TrainingArguments

args = TrainingArguments("working_dir") args = args.set_optimizer(name="adamw_torch", beta1=0.8) args.optim 'adamw_torch'

set_push_to_hub

< source >

( model_id: str strategy: typing.Union[str, transformers.trainer_utils.HubStrategy] = 'every_save' token: typing.Optional[str] = None private_repo: typing.Optional[bool] = None always_push: bool = False )

Parameters

A method that regroups all arguments linked to synchronizing checkpoints with the Hub.

Calling this method will set self.push_to_hub to True, which means the output_dir will begin a git directory synced with the repo (determined by model_id) and the content will be pushed each time a save is triggered (depending on your self.save_strategy). Calling save_model() will also trigger a push.

Example:

from transformers import TrainingArguments

args = TrainingArguments("working_dir") args = args.set_push_to_hub("me/awesome-model") args.hub_model_id 'me/awesome-model'

set_save

< source >

( strategy: typing.Union[str, transformers.trainer_utils.IntervalStrategy] = 'steps' steps: int = 500 total_limit: typing.Optional[int] = None on_each_node: bool = False )

Parameters

A method that regroups all arguments linked to checkpoint saving.

Example:

from transformers import TrainingArguments

args = TrainingArguments("working_dir") args = args.set_save(strategy="steps", steps=100) args.save_steps 100

set_testing

< source >

( batch_size: int = 8 loss_only: bool = False jit_mode: bool = False )

Parameters

A method that regroups all basic arguments linked to testing on a held-out dataset.

Calling this method will automatically set self.do_predict to True.

Example:

from transformers import TrainingArguments

args = TrainingArguments("working_dir") args = args.set_testing(batch_size=32) args.per_device_eval_batch_size 32

set_training

< source >

( learning_rate: float = 5e-05 batch_size: int = 8 weight_decay: float = 0 num_epochs: float = 3 max_steps: int = -1 gradient_accumulation_steps: int = 1 seed: int = 42 gradient_checkpointing: bool = False )

Parameters

A method that regroups all basic arguments linked to the training.

Calling this method will automatically set self.do_train to True.

Example:

from transformers import TrainingArguments

args = TrainingArguments("working_dir") args = args.set_training(learning_rate=1e-4, batch_size=32) args.learning_rate 1e-4

Serializes this instance while replace Enum by their values (for JSON serialization support). It obfuscates the token values by removing their value.

Serializes this instance to a JSON string.

Sanitized serialization to use with TensorBoard’s hparams

class transformers.Seq2SeqTrainingArguments

< source >

( output_dir: typing.Optional[str] = None overwrite_output_dir: bool = False do_train: bool = False do_eval: bool = False do_predict: bool = False eval_strategy: typing.Union[transformers.trainer_utils.IntervalStrategy, str] = 'no' prediction_loss_only: bool = False per_device_train_batch_size: int = 8 per_device_eval_batch_size: int = 8 per_gpu_train_batch_size: typing.Optional[int] = None per_gpu_eval_batch_size: typing.Optional[int] = None gradient_accumulation_steps: int = 1 eval_accumulation_steps: typing.Optional[int] = None eval_delay: typing.Optional[float] = 0 torch_empty_cache_steps: typing.Optional[int] = None learning_rate: float = 5e-05 weight_decay: float = 0.0 adam_beta1: float = 0.9 adam_beta2: float = 0.999 adam_epsilon: float = 1e-08 max_grad_norm: float = 1.0 num_train_epochs: float = 3.0 max_steps: int = -1 lr_scheduler_type: typing.Union[transformers.trainer_utils.SchedulerType, str] = 'linear' lr_scheduler_kwargs: typing.Union[dict, str, NoneType] = warmup_ratio: float = 0.0 warmup_steps: int = 0 log_level: typing.Optional[str] = 'passive' log_level_replica: typing.Optional[str] = 'warning' log_on_each_node: bool = True logging_dir: typing.Optional[str] = None logging_strategy: typing.Union[transformers.trainer_utils.IntervalStrategy, str] = 'steps' logging_first_step: bool = False logging_steps: float = 500 logging_nan_inf_filter: bool = True save_strategy: typing.Union[transformers.trainer_utils.SaveStrategy, str] = 'steps' save_steps: float = 500 save_total_limit: typing.Optional[int] = None save_safetensors: typing.Optional[bool] = True save_on_each_node: bool = False save_only_model: bool = False restore_callback_states_from_checkpoint: bool = False no_cuda: bool = False use_cpu: bool = False use_mps_device: bool = False seed: int = 42 data_seed: typing.Optional[int] = None jit_mode_eval: bool = False use_ipex: bool = False bf16: bool = False fp16: bool = False fp16_opt_level: str = 'O1' half_precision_backend: str = 'auto' bf16_full_eval: bool = False fp16_full_eval: bool = False tf32: typing.Optional[bool] = None local_rank: int = -1 ddp_backend: typing.Optional[str] = None tpu_num_cores: typing.Optional[int] = None tpu_metrics_debug: bool = False debug: typing.Union[str, list[transformers.debug_utils.DebugOption]] = '' dataloader_drop_last: bool = False eval_steps: typing.Optional[float] = None dataloader_num_workers: int = 0 dataloader_prefetch_factor: typing.Optional[int] = None past_index: int = -1 run_name: typing.Optional[str] = None disable_tqdm: typing.Optional[bool] = None remove_unused_columns: typing.Optional[bool] = True label_names: typing.Optional[list[str]] = None load_best_model_at_end: typing.Optional[bool] = False metric_for_best_model: typing.Optional[str] = None greater_is_better: typing.Optional[bool] = None ignore_data_skip: bool = False fsdp: typing.Union[list[transformers.trainer_utils.FSDPOption], str, NoneType] = '' fsdp_min_num_params: int = 0 fsdp_config: typing.Union[dict, str, NoneType] = None tp_size: typing.Optional[int] = 0 fsdp_transformer_layer_cls_to_wrap: typing.Optional[str] = None accelerator_config: typing.Union[dict, str, NoneType] = None deepspeed: typing.Union[dict, str, NoneType] = None label_smoothing_factor: float = 0.0 optim: typing.Union[transformers.training_args.OptimizerNames, str] = 'adamw_torch' optim_args: typing.Optional[str] = None adafactor: bool = False group_by_length: bool = False length_column_name: typing.Optional[str] = 'length' report_to: typing.Union[NoneType, str, list[str]] = None ddp_find_unused_parameters: typing.Optional[bool] = None ddp_bucket_cap_mb: typing.Optional[int] = None ddp_broadcast_buffers: typing.Optional[bool] = None dataloader_pin_memory: bool = True dataloader_persistent_workers: bool = False skip_memory_metrics: bool = True use_legacy_prediction_loop: bool = False push_to_hub: bool = False resume_from_checkpoint: typing.Optional[str] = None hub_model_id: typing.Optional[str] = None hub_strategy: typing.Union[transformers.trainer_utils.HubStrategy, str] = 'every_save' hub_token: typing.Optional[str] = None hub_private_repo: typing.Optional[bool] = None hub_always_push: bool = False gradient_checkpointing: bool = False gradient_checkpointing_kwargs: typing.Union[dict, str, NoneType] = None include_inputs_for_metrics: bool = False include_for_metrics: list = eval_do_concat_batches: bool = True fp16_backend: str = 'auto' push_to_hub_model_id: typing.Optional[str] = None push_to_hub_organization: typing.Optional[str] = None push_to_hub_token: typing.Optional[str] = None mp_parameters: str = '' auto_find_batch_size: bool = False full_determinism: bool = False torchdynamo: typing.Optional[str] = None ray_scope: typing.Optional[str] = 'last' ddp_timeout: typing.Optional[int] = 1800 torch_compile: bool = False torch_compile_backend: typing.Optional[str] = None torch_compile_mode: typing.Optional[str] = None include_tokens_per_second: typing.Optional[bool] = False include_num_input_tokens_seen: typing.Optional[bool] = False neftune_noise_alpha: typing.Optional[float] = None optim_target_modules: typing.Union[NoneType, str, list[str]] = None batch_eval_metrics: bool = False eval_on_start: bool = False use_liger_kernel: typing.Optional[bool] = False eval_use_gather_object: typing.Optional[bool] = False average_tokens_across_devices: typing.Optional[bool] = False sortish_sampler: bool = False predict_with_generate: bool = False generation_max_length: typing.Optional[int] = None generation_num_beams: typing.Optional[int] = None generation_config: typing.Union[str, pathlib.Path, transformers.generation.configuration_utils.GenerationConfig, NoneType] = None )

Parameters

TrainingArguments is the subset of the arguments we use in our example scripts which relate to the training loop itself.

Using HfArgumentParser we can turn this class intoargparse arguments that can be specified on the command line.

Serializes this instance while replace Enum by their values and GenerationConfig by dictionaries (for JSON serialization support). It obfuscates the token values by removing their value.