Trainer — PyTorch Lightning 2.5.1.post0 documentation (original) (raw)

class lightning.pytorch.trainer.trainer.Trainer(*, accelerator='auto', strategy='auto', devices='auto', num_nodes=1, precision=None, logger=None, callbacks=None, fast_dev_run=False, max_epochs=None, min_epochs=None, max_steps=-1, min_steps=None, max_time=None, limit_train_batches=None, limit_val_batches=None, limit_test_batches=None, limit_predict_batches=None, overfit_batches=0.0, val_check_interval=None, check_val_every_n_epoch=1, num_sanity_val_steps=None, log_every_n_steps=None, enable_checkpointing=None, enable_progress_bar=None, enable_model_summary=None, accumulate_grad_batches=1, gradient_clip_val=None, gradient_clip_algorithm=None, deterministic=None, benchmark=None, inference_mode=True, use_distributed_sampler=True, profiler=None, detect_anomaly=False, barebones=False, plugins=None, sync_batchnorm=False, reload_dataloaders_every_n_epochs=0, default_root_dir=None, model_registry=None)[source]

Bases: object

Customize every aspect of training via flags.

Parameters:

Raises:

fit(model, train_dataloaders=None, val_dataloaders=None, datamodule=None, ckpt_path=None)[source]

Runs the full optimization routine.

Parameters:

Raises:

TypeError – If model is not LightningModule for torch version less than 2.0.0 and if model is not LightningModule ortorch._dynamo.OptimizedModule for torch versions greater than or equal to 2.0.0 .

For more information about multiple dataloaders, see this section. :rtype: None

init_module(empty_init=None)[source]

Tensors that you instantiate under this context manager will be created on the device right away and have the right data type depending on the precision setting in the Trainer.

The parameters and tensors get created on the device and with the right data type right away without wasting memory being allocated unnecessarily.

Parameters:

empty_init (Optional[bool]) – Whether to initialize the model with empty weights (uninitialized memory). If None, the strategy will decide. Some strategies may not support all options. Set this to True if you are loading a checkpoint into a large model.

Return type:

Generator

predict(model=None, dataloaders=None, datamodule=None, return_predictions=None, ckpt_path=None)[source]

Run inference on your data. This will call the model forward function to compute predictions. Useful to perform distributed and batched predictions. Logging is disabled in the predict hooks.

Parameters:

For more information about multiple dataloaders, see this section.

Return type:

Union[list[Any], list[list[Any]], None]

Returns:

Returns a list of dictionaries, one for each provided dataloader containing their respective predictions.

Raises:

See Lightning inference section for more.

print(*args, **kwargs)[source]

Print something only on the first process. If running on multiple machines, it will print from the first process in each machine.

Arguments passed to this method are forwarded to the Python built-in print() function.

Return type:

None

save_checkpoint(filepath, weights_only=False, storage_options=None)[source]

Runs routine to create a checkpoint.

This method needs to be called on all processes in case the selected strategy is handling distributed checkpointing.

Parameters:

Raises:

AttributeError – If the model is not attached to the Trainer before calling this method.

Return type:

None

test(model=None, dataloaders=None, ckpt_path=None, verbose=True, datamodule=None)[source]

Perform one evaluation epoch over the test set. It’s separated from fit to make sure you never run on your test set until you want to.

Parameters:

For more information about multiple dataloaders, see this section.

Return type:

list[Mapping[str, float]]

Returns:

List of dictionaries with metrics logged during the test phase, e.g., in model- or callback hooks like test_step() etc. The length of the list corresponds to the number of test dataloaders used.

Raises:

validate(model=None, dataloaders=None, ckpt_path=None, verbose=True, datamodule=None)[source]

Perform one evaluation epoch over the validation set.

Parameters:

For more information about multiple dataloaders, see this section.

Return type:

list[Mapping[str, float]]

Returns:

List of dictionaries with metrics logged during the validation phase, e.g., in model- or callback hooks like validation_step() etc. The length of the list corresponds to the number of validation dataloaders used.

Raises:

property callback_metrics_: dict[str, torch.Tensor]_

The metrics available to callbacks.

def training_step(self, batch, batch_idx): self.log("a_val", 2.0)

callback_metrics = trainer.callback_metrics assert callback_metrics["a_val"] == 2.0

property checkpoint_callback_: Optional[Checkpoint]_

The first ModelCheckpoint callback in the Trainer.callbacks list, or None if it doesn’t exist.

property checkpoint_callbacks_: list[lightning.pytorch.callbacks.checkpoint.Checkpoint]_

A list of all instances of ModelCheckpoint found in the Trainer.callbacks list.

property ckpt_path_: Optional[Union[str, Path]]_

Set to the path/URL of a checkpoint loaded via fit(),validate(),test(), orpredict().

None otherwise.

property current_epoch_: int_

The current epoch, updated after the epoch end hooks are run.

property default_root_dir_: str_

The default location to save artifacts of loggers, checkpoints etc.

It is used as a fallback if logger or checkpoint callback do not define specific save paths.

property device_ids_: list[int]_

List of device indexes per node.

property early_stopping_callback_: Optional[EarlyStopping]_

The first EarlyStopping callback in the Trainer.callbacks list, or None if it doesn’t exist.

property early_stopping_callbacks_: list[lightning.pytorch.callbacks.early_stopping.EarlyStopping]_

A list of all instances of EarlyStopping found in the Trainer.callbacks list.

property enable_validation_: bool_

Check if we should run validation during training.

property estimated_stepping_batches_: Union[int, float]_

The estimated number of batches that will optimizer.step() during training.

This accounts for gradient accumulation and the current trainer configuration. This might be used when setting up your training dataloader, if it hasn’t been set up already.

def configure_optimizers(self): optimizer = ... stepping_batches = self.trainer.estimated_stepping_batches scheduler = torch.optim.lr_scheduler.OneCycleLR(optimizer, max_lr=1e-3, total_steps=stepping_batches) return [optimizer], [scheduler]

Raises:

MisconfigurationException – If estimated stepping batches cannot be computed due to different accumulate_grad_batches at different epochs.

property global_step_: int_

The number of optimizer steps taken (does not reset each epoch).

This includes multiple optimizers (if enabled).

property is_global_zero_: bool_

Whether this process is the global zero in multi-node training.

def training_step(self, batch, batch_idx): if self.trainer.is_global_zero: print("in node 0, accelerator 0")

property is_last_batch_: bool_

Whether trainer is executing the last batch.

property log_dir_: Optional[str]_

The directory for the current experiment. Use this to save images to, etc…

Note

You must call this on all processes. Failing to do so will cause your program to stall forever.

def training_step(self, batch, batch_idx): img = ... save_img(img, self.trainer.log_dir)

property logged_metrics_: dict[str, torch.Tensor]_

The metrics sent to the loggers.

This includes metrics logged via log() with thelogger argument set.

property logger_: Optional[Logger]_

The first Logger being used.

property loggers_: list[lightning.pytorch.loggers.logger.Logger]_

The list of Logger used.

for logger in trainer.loggers: logger.log_metrics({"foo": 1.0})

property model_: Optional[Module]_

The LightningModule, but possibly wrapped into DataParallel or DistributedDataParallel.

To access the pure LightningModule, uselightning_module() instead.

property num_devices_: int_

Number of devices the trainer uses per node.

property num_predict_batches_: list[Union[int, float]]_

The number of prediction batches that will be used during trainer.predict().

property num_sanity_val_batches_: list[Union[int, float]]_

The number of validation batches that will be used during the sanity-checking part of trainer.fit().

property num_test_batches_: list[Union[int, float]]_

The number of test batches that will be used during trainer.test().

property num_training_batches_: Union[int, float]_

The number of training batches that will be used during trainer.fit().

property num_val_batches_: list[Union[int, float]]_

The number of validation batches that will be used during trainer.fit() or trainer.validate().

property predict_dataloaders_: Optional[Any]_

The prediction dataloader(s) used during trainer.predict().

property progress_bar_callback_: Optional[ProgressBar]_

An instance of ProgressBar found in the Trainer.callbacks list, or None if one doesn’t exist.

property progress_bar_metrics_: dict[str, float]_

The metrics sent to the progress bar.

This includes metrics logged via log() with theprog_bar argument set.

property received_sigterm_: bool_

Whether a signal.SIGTERM signal was received.

For example, this can be checked to exit gracefully.

property sanity_checking_: bool_

Whether sanity checking is running.

Useful to disable some hooks, logging or callbacks during the sanity checking.

property test_dataloaders_: Optional[Any]_

The test dataloader(s) used during trainer.test().

property train_dataloader_: Optional[Any]_

The training dataloader(s) used during trainer.fit().

property val_dataloaders_: Optional[Any]_

The validation dataloader(s) used during trainer.fit() or trainer.validate().