Training Arguments — Sentence Transformers documentation (original) (raw)

SentenceTransformerTrainingArguments

class sentence_transformers.training_args.SentenceTransformerTrainingArguments(output_dir: str | None = None, overwrite_output_dir: bool = False, do_train: bool = False, do_eval: bool = False, do_predict: bool = False, eval_strategy: ~transformers.trainer_utils.IntervalStrategy | str = 'no', prediction_loss_only: bool = False, per_device_train_batch_size: int = 8, per_device_eval_batch_size: int = 8, per_gpu_train_batch_size: int | None = None, per_gpu_eval_batch_size: int | None = None, gradient_accumulation_steps: int = 1, eval_accumulation_steps: int | None = None, eval_delay: float | None = 0, torch_empty_cache_steps: int | None = None, learning_rate: float = 5e-05, weight_decay: float = 0.0, adam_beta1: float = 0.9, adam_beta2: float = 0.999, adam_epsilon: float = 1e-08, max_grad_norm: float = 1.0, num_train_epochs: float = 3.0, max_steps: int = -1, lr_scheduler_type: ~transformers.trainer_utils.SchedulerType | str = 'linear', lr_scheduler_kwargs: dict | str | None = , warmup_ratio: float | None = , warmup_steps: int = 0, log_level: str | None = 'passive', log_level_replica: str | None = 'warning', log_on_each_node: bool = True, logging_dir: str | None = None, logging_strategy: ~transformers.trainer_utils.IntervalStrategy | str = 'steps', logging_first_step: bool = False, logging_steps: float = 500, logging_nan_inf_filter: bool = True, save_strategy: ~transformers.trainer_utils.SaveStrategy | str = 'steps', save_steps: float = 500, save_total_limit: int | None = None, save_safetensors: bool | None = True, save_on_each_node: bool = False, save_only_model: bool = False, restore_callback_states_from_checkpoint: bool = False, no_cuda: bool = False, use_cpu: bool = False, use_mps_device: bool = False, seed: int = 42, data_seed: int | None = None, jit_mode_eval: bool = False, use_ipex: bool = False, bf16: bool = False, fp16: bool = False, fp16_opt_level: str = 'O1', half_precision_backend: str = 'auto', bf16_full_eval: bool = False, fp16_full_eval: bool = False, tf32: bool | None = None, local_rank: int = -1, ddp_backend: str | None = None, tpu_num_cores: int | None = None, tpu_metrics_debug: bool = False, debug: str | list[~transformers.debug_utils.DebugOption] = '', dataloader_drop_last: bool = False, eval_steps: float | None = None, dataloader_num_workers: int = 0, dataloader_prefetch_factor: int | None = None, past_index: int = -1, run_name: str | None = None, disable_tqdm: bool | None = None, remove_unused_columns: bool | None = True, label_names: list[str] | None = None, load_best_model_at_end: bool | None = False, metric_for_best_model: str | None = None, greater_is_better: bool | None = None, ignore_data_skip: bool = False, fsdp: list[~transformers.trainer_utils.FSDPOption] | str | None = '', fsdp_min_num_params: int = 0, fsdp_config: dict | str | None = None, tp_size: int | None = 0, fsdp_transformer_layer_cls_to_wrap: str | None = None, accelerator_config: dict | str | None = None, deepspeed: dict | str | None = None, label_smoothing_factor: float = 0.0, optim: ~transformers.training_args.OptimizerNames | str = 'adamw_torch', optim_args: str | None = None, adafactor: bool = False, group_by_length: bool = False, length_column_name: str | None = 'length', report_to: str | list[str] | None = None, ddp_find_unused_parameters: bool | None = None, ddp_bucket_cap_mb: int | None = None, ddp_broadcast_buffers: bool | None = None, dataloader_pin_memory: bool = True, dataloader_persistent_workers: bool = False, skip_memory_metrics: bool = True, use_legacy_prediction_loop: bool = False, push_to_hub: bool = False, resume_from_checkpoint: str | None = None, hub_model_id: str | None = None, hub_strategy: ~transformers.trainer_utils.HubStrategy | str = 'every_save', hub_token: str | None = None, hub_private_repo: bool | None = None, hub_always_push: bool = False, gradient_checkpointing: bool = False, gradient_checkpointing_kwargs: dict | str | None = None, include_inputs_for_metrics: bool = False, include_for_metrics: list[str] = , eval_do_concat_batches: bool = True, fp16_backend: str = 'auto', push_to_hub_model_id: str | None = None, push_to_hub_organization: str | None = None, push_to_hub_token: str | None = None, mp_parameters: str = '', auto_find_batch_size: bool = False, full_determinism: bool = False, torchdynamo: str | None = None, ray_scope: str | None = 'last', ddp_timeout: int | None = 1800, torch_compile: bool = False, torch_compile_backend: str | None = None, torch_compile_mode: str | None = None, include_tokens_per_second: bool | None = False, include_num_input_tokens_seen: bool | None = False, neftune_noise_alpha: float | None = None, optim_target_modules: str | list[str] | None = None, batch_eval_metrics: bool = False, eval_on_start: bool = False, use_liger_kernel: bool | None = False, eval_use_gather_object: bool | None = False, average_tokens_across_devices: bool | None = False, prompts: dict[str, dict[str, str]] | dict[str, str] | str | None = None, batch_sampler: ~sentence_transformers.training_args.BatchSamplers | str | ~sentence_transformers.sampler.DefaultBatchSampler | ~collections.abc.Callable[[...], ~sentence_transformers.sampler.DefaultBatchSampler] = BatchSamplers.BATCH_SAMPLER, multi_dataset_batch_sampler: ~sentence_transformers.training_args.MultiDatasetBatchSamplers | str | ~sentence_transformers.sampler.MultiDatasetDefaultBatchSampler | ~collections.abc.Callable[[...], ~sentence_transformers.sampler.MultiDatasetDefaultBatchSampler] = MultiDatasetBatchSamplers.PROPORTIONAL, router_mapping: str | None | dict[str, str] | dict[str, dict[str, str]] = , learning_rate_mapping: str | None | dict[str, float] = )[source]

SentenceTransformerTrainingArguments extends TrainingArguments with additional arguments specific to Sentence Transformers. See TrainingArguments for the complete list of available arguments.

Parameters:

property ddp_timeout_delta_: timedelta_

The actual timeout for torch.distributed.init_process_group since it expects a timedelta variable.

property device_: device_

The device used by this process.

property eval_batch_size_: int_

The actual batch size for evaluation (may differ from per_gpu_eval_batch_size in distributed training).

get_process_log_level()

Returns the log level to be used depending on whether this process is the main process of node 0, main process of node non-0, or a non-main process.

For the main process the log level defaults to the logging level set (logging.WARNING if you didn’t do anything) unless overridden by log_level argument.

For the replica processes the log level defaults to logging.WARNING unless overridden by log_level_replicaargument.

The choice between the main and replica process settings is made according to the return value of should_log.

get_warmup_steps(num_training_steps: int)

Get number of steps used for a linear warmup.

property local_process_index

The index of the local process used.

main_process_first(local=True, desc='work')

A context manager for torch distributed environment where on needs to do something on the main process, while blocking replicas, and when it’s finished releasing the replicas.

One such use is for datasets’s map feature which to be efficient should be run once on the main process, which upon completion saves a cached version of results and which then automatically gets loaded by the replicas.

Parameters:

property n_gpu

The number of GPUs used by this process.

Note

This will only be greater than one when you have multiple GPUs available but are not using distributed training. For distributed training, it will always be 1.

property parallel_mode

The current mode used for parallelism if multiple GPUs/TPU cores are available. One of:

property place_model_on_device

Can be subclassed and overridden for some specific integrations.

property process_index

The index of the current process used.

set_dataloader(train_batch_size: int = 8, eval_batch_size: int = 8, drop_last: bool = False, num_workers: int = 0, pin_memory: bool = True, persistent_workers: bool = False, prefetch_factor: int | None = None, auto_find_batch_size: bool = False, ignore_data_skip: bool = False, sampler_seed: int | None = None)

A method that regroups all arguments linked to the dataloaders creation.

Parameters:

Example:

```py >>> from transformers import TrainingArguments

args = TrainingArguments("working_dir") args = args.set_dataloader(train_batch_size=16, eval_batch_size=64) args.per_device_train_batch_size 16


set\_evaluate(_strategy: str | IntervalStrategy \= 'no'_, _steps: int \= 500_, _batch\_size: int \= 8_, _accumulation\_steps: int | None \= None_, _delay: float | None \= None_, _loss\_only: bool \= False_, _jit\_mode: bool \= False_)[](#sentence%5Ftransformers.training%5Fargs.SentenceTransformerTrainingArguments.set%5Fevaluate "Link to this definition")

A method that regroups all arguments linked to evaluation.

Parameters:

* **strategy** (str or \[\~trainer\_utils.IntervalStrategy\], _optional_, defaults to “no”) –  
The evaluation strategy to adopt during training. Possible values are:  
>   * ”no”: No evaluation is done during training.  
>   * ”steps”: Evaluation is done (and logged) every steps.  
>   * ”epoch”: Evaluation is done at the end of each epoch.  
Setting a strategy different from “no” will set self.do\_eval to True.
* **steps** (int, _optional_, defaults to 500) – Number of update steps between two evaluations if strategy=”steps”.
* **batch\_size** (int _optional_, defaults to 8) – The batch size per device (GPU/TPU core/CPU…) used for evaluation.
* **accumulation\_steps** (int, _optional_) – Number of predictions steps to accumulate the output tensors for, before moving the results to the CPU. If left unset, the whole predictions are accumulated on GPU/TPU before being moved to the CPU (faster but requires more memory).
* **delay** (float, _optional_) – Number of epochs or steps to wait for before the first evaluation can be performed, depending on the eval\_strategy.
* **loss\_only** (bool, _optional_, defaults to False) – Ignores all outputs except the loss.
* **jit\_mode** (bool, _optional_) – Whether or not to use PyTorch jit trace for inference.

Example:

[\`\`](#id5)[\`](#id7)py >>> from transformers import TrainingArguments

>>> args = TrainingArguments("working_dir")
>>> args = args.set_evaluate(strategy="steps", steps=100)
>>> args.eval_steps
100

set_logging(strategy: str | IntervalStrategy = 'steps', steps: int = 500, report_to: str | list[str] = 'none', level: str = 'passive', first_step: bool = False, nan_inf_filter: bool = False, on_each_node: bool = False, replica_level: str = 'passive')

A method that regroups all arguments linked to logging.

Parameters:

Example:

```py >>> from transformers import TrainingArguments

args = TrainingArguments("working_dir") args = args.set_logging(strategy="steps", steps=100) args.logging_steps 100


set\_lr\_scheduler(_name: str | SchedulerType \= 'linear'_, _num\_epochs: float \= 3.0_, _max\_steps: int \= \-1_, _warmup\_ratio: float \= 0_, _warmup\_steps: int \= 0_)[](#sentence%5Ftransformers.training%5Fargs.SentenceTransformerTrainingArguments.set%5Flr%5Fscheduler "Link to this definition")

A method that regroups all arguments linked to the learning rate scheduler and its hyperparameters.

Parameters:

* **name** (str or \[SchedulerType\], _optional_, defaults to “linear”) – The scheduler type to use. See the documentation of \[SchedulerType\] for all possible values.
* **num\_epochs** (float, _optional_, defaults to 3.0) – Total number of training epochs to perform (if not an integer, will perform the decimal part percents of the last epoch before stopping training).
* **max\_steps** (int, _optional_, defaults to -1) – If set to a positive number, the total number of training steps to perform. Overrides num\_train\_epochs. For a finite dataset, training is reiterated through the dataset (if all data is exhausted) untilmax\_steps is reached.
* **warmup\_ratio** (float, _optional_, defaults to 0.0) – Ratio of total training steps used for a linear warmup from 0 to learning\_rate.
* **warmup\_steps** (int, _optional_, defaults to 0) – Number of steps used for a linear warmup from 0 to learning\_rate. Overrides any effect ofwarmup\_ratio.

Example:

[\`\`](#id13)[\`](#id15)py >>> from transformers import TrainingArguments

>>> args = TrainingArguments("working_dir")
>>> args = args.set_lr_scheduler(name="cosine", warmup_ratio=0.05)
>>> args.warmup_ratio
0.05

set_optimizer(name: str | OptimizerNames = 'adamw_torch', learning_rate: float = 5e-05, weight_decay: float = 0, beta1: float = 0.9, beta2: float = 0.999, epsilon: float = 1e-08, args: str | None = None)

A method that regroups all arguments linked to the optimizer and its hyperparameters.

Parameters:

Example:

```py >>> from transformers import TrainingArguments

args = TrainingArguments("working_dir") args = args.set_optimizer(name="adamw_torch", beta1=0.8) args.optim 'adamw_torch'


set\_push\_to\_hub(_model\_id: str_, _strategy: str | HubStrategy \= 'every\_save'_, _token: str | None \= None_, _private\_repo: bool | None \= None_, _always\_push: bool \= False_)[](#sentence%5Ftransformers.training%5Fargs.SentenceTransformerTrainingArguments.set%5Fpush%5Fto%5Fhub "Link to this definition")

A method that regroups all arguments linked to synchronizing checkpoints with the Hub.

<Tip>

Calling this method will set self.push\_to\_hub to True, which means the output\_dir will begin a git directory synced with the repo (determined by model\_id) and the content will be pushed each time a save is triggered (depending on your self.save\_strategy). Calling \[\~Trainer.save\_model\] will also trigger a push.

</Tip>

Parameters:

* **model\_id** (str) – The name of the repository to keep in sync with the local _output\_dir_. It can be a simple model ID in which case the model will be pushed in your namespace. Otherwise it should be the whole repository name, for instance “user\_name/model”, which allows you to push to an organization you are a member of with “organization\_name/model”.
* **strategy** (str or \[\~trainer\_utils.HubStrategy\], _optional_, defaults to “every\_save”) –  
Defines the scope of what is pushed to the Hub and when. Possible values are:  
   * ”end”: push the model, its configuration, the processing\_class e.g. tokenizer (if passed along to the \[Trainer\]) and a  
draft of a model card when the \[\~Trainer.save\_model\] method is called. - “every\_save”: push the model, its configuration, the processing\_class e.g. tokenizer (if passed along to the \[Trainer\])  
> and  
a draft of a model card each time there is a model save. The pushes are asynchronous to not block training, and in case the save are very frequent, a new push is only attempted if the previous one is finished. A last push is made with the final model at the end of training. - “checkpoint”: like “every\_save” but the latest checkpoint is also pushed in a subfolder named last-checkpoint, allowing you to resume training easily withtrainer.train(resume\_from\_checkpoint=”last-checkpoint”). - “all\_checkpoints”: like “checkpoint” but all checkpoints are pushed like they appear in the  
> output  
folder (so you will get one checkpoint folder per folder in your final repository)
* **token** (str, _optional_) – The token to use to push the model to the Hub. Will default to the token in the cache folder obtained with huggingface-cli login.
* **private\_repo** (bool, _optional_, defaults to False) – Whether to make the repo private. If None (default), the repo will be public unless the organization’s default is private. This value is ignored if the repo already exists.
* **always\_push** (bool, _optional_, defaults to False) – Unless this is True, the Trainer will skip pushing a checkpoint when the previous push is not finished.

Example:

[\`\`](#id21)[\`](#id23)py >>> from transformers import TrainingArguments

>>> args = TrainingArguments("working_dir")
>>> args = args.set_push_to_hub("me/awesome-model")
>>> args.hub_model_id
'me/awesome-model'

set_save(strategy: str | IntervalStrategy = 'steps', steps: int = 500, total_limit: int | None = None, on_each_node: bool = False)

A method that regroups all arguments linked to checkpoint saving.

Parameters:

Example:

```py >>> from transformers import TrainingArguments

args = TrainingArguments("working_dir") args = args.set_save(strategy="steps", steps=100) args.save_steps 100


set\_testing(_batch\_size: int \= 8_, _loss\_only: bool \= False_, _jit\_mode: bool \= False_)[](#sentence%5Ftransformers.training%5Fargs.SentenceTransformerTrainingArguments.set%5Ftesting "Link to this definition")

A method that regroups all basic arguments linked to testing on a held-out dataset.

<Tip>

Calling this method will automatically set self.do\_predict to True.

</Tip>

Parameters:

* **batch\_size** (int _optional_, defaults to 8) – The batch size per device (GPU/TPU core/CPU…) used for testing.
* **loss\_only** (bool, _optional_, defaults to False) – Ignores all outputs except the loss.
* **jit\_mode** (bool, _optional_) – Whether or not to use PyTorch jit trace for inference.

Example:

[\`\`](#id29)[\`](#id31)py >>> from transformers import TrainingArguments

>>> args = TrainingArguments("working_dir")
>>> args = args.set_testing(batch_size=32)
>>> args.per_device_eval_batch_size
32

set_training(learning_rate: float = 5e-05, batch_size: int = 8, weight_decay: float = 0, num_epochs: float = 3, max_steps: int = -1, gradient_accumulation_steps: int = 1, seed: int = 42, gradient_checkpointing: bool = False)

A method that regroups all basic arguments linked to the training.

Calling this method will automatically set self.do_train to True.

Parameters:

Example:

```py >>> from transformers import TrainingArguments

args = TrainingArguments("working_dir") args = args.set_training(learning_rate=1e-4, batch_size=32) args.learning_rate 1e-4

```

property should_log

Whether or not the current process should produce log.

property should_save

Whether or not the current process should write to disk, e.g., to save models and checkpoints.

to_dict()[source]

Serializes this instance while replace Enum by their values (for JSON serialization support). It obfuscates the token values by removing their value.

to_json_string()

Serializes this instance to a JSON string.

to_sanitized_dict() → dict[str, Any]

Sanitized serialization to use with TensorBoard’s hparams

property train_batch_size_: int_

The actual batch size for training (may differ from per_gpu_train_batch_size in distributed training).

property world_size

The number of processes used in parallel.