Trainer (original) (raw)

class composer.Trainer(*, model, train_dataloader=None, train_dataloader_label='train', train_subset_num_batches=- 1, spin_dataloaders=True, max_duration=None, algorithms=None, algorithm_passes=None, optimizers=None, schedulers=None, scale_schedule_ratio=1.0, step_schedulers_every_batch=None, eval_dataloader=None, eval_interval=1, eval_subset_num_batches=- 1, callbacks=None, loggers=None, run_name=None, progress_bar=True, log_to_console=False, console_stream='stderr', console_log_interval='1ba', log_traces=False, auto_log_hparams=False, load_path=None, load_object_store=None, load_weights_only=False, load_strict_model_weights=True, load_progress_bar=True, load_ignore_keys=None, load_exclude_algorithms=None, save_folder=None, save_filename='ep{epoch}-ba{batch}-rank{rank}.pt', save_latest_filename='latest-rank{rank}.pt', save_overwrite=False, save_interval='1ep', save_weights_only=False, save_ignore_keys=None, save_num_checkpoints_to_keep=- 1, save_metrics=False, autoresume=False, deepspeed_config=None, fsdp_config=None, fsdp_auto_wrap=True, parallelism_config=None, device=None, precision=None, precision_config=None, device_train_microbatch_size=None, accumulate_train_batch_on_tokens=False, seed=None, deterministic_mode=False, dist_timeout=300.0, ddp_sync_strategy=None, profiler=None, python_log_level=None, compile_config=None)[source]#

Train models with Composer algorithms.

The trainer supports models with ComposerModel instances. The Trainer is highly customizable and can support a wide variety of workloads. See the training guide for more information.

Example

Train a model and save a checkpoint:

import os from composer import Trainer

Create a trainer

trainer = Trainer( model=model, train_dataloader=train_dataloader, max_duration="1ep", eval_dataloader=eval_dataloader, optimizers=optimizer, schedulers=scheduler, device="cpu", eval_interval="1ep", save_folder="checkpoints", save_filename="ep{epoch}.pt", save_interval="1ep", save_overwrite=True, )

Fit and run evaluation for 1 epoch.

Save a checkpoint after 1 epoch as specified during trainer creation.

trainer.fit()

Load the checkpoint and resume training:

Get the saved checkpoint filepath

checkpoint_path = trainer.saved_checkpoints.pop()

Create a new trainer with the load_path argument set to the checkpoint path.

trainer = Trainer( model=model, train_dataloader=train_dataloader, max_duration="2ep", eval_dataloader=eval_dataloader, optimizers=optimizer, schedulers=scheduler, device="cpu", eval_interval="1ep", load_path=checkpoint_path, )

Continue training and running evaluation where the previous trainer left off

until the new max_duration is reached.

In this case it will be one additional epoch to reach 2 epochs total.

trainer.fit()

Parameters

Create the object store provider with the specified credentials

creds = {"key": "object_store_key",
"secret": "object_store_secret"}
store = LibcloudObjectStore(provider="s3",
container="my_container",
provider_kwargs=creds)
checkpoint_path = "./path_to_the_checkpoint_in_object_store"

Create a trainer which will load a checkpoint from the specified object store

trainer = Trainer(
model=model,
train_dataloader=train_dataloader,
max_duration="10ep",
eval_dataloader=eval_dataloader,
optimizers=optimizer,
schedulers=scheduler,
device="cpu",
eval_interval="1ep",
load_path=checkpoint_path,
load_object_store=store,
)

state#

The State object used to store training state.

Type

State

evaluators#

The Evaluator objects to use for validation during training.

Type

list[Evaluator]

logger#

The Logger used for logging.

Type

Logger

engine#

The Engine used for running callbacks and algorithms.

Type

Engine

close()[source]#

Shutdown the trainer.

eval(eval_dataloader=None, subset_num_batches=- 1)[source]#

Run evaluation loop.

Results are stored in trainer.state.eval_metrics. The eval_dataloader can be provided to either the eval() method or during training init().

Examples: .. testcode:

trainer = Trainer( model=model, train_dataloader=train_dataloader, max_duration="2ep", device="cpu", )

trainer.fit()

run eval

trainer.eval( eval_dataloader=eval_dataloader, )

Or, if the eval_dataloader is provided during init:

trainer = Trainer( model=model, eval_dataloader=eval_dataloader, train_dataloader=train_dataloader, max_duration="2ep", device="cpu", )

trainer.fit()

eval_dataloader already provided:

trainer.eval()

For multiple metrics or dataloaders, use Evaluator to provide identifier names. For example, to run the GLUE task:

from composer.core import Evaluator from composer.models.nlp_metrics import BinaryF1Score

glue_mrpc_task = Evaluator( label='glue_mrpc', dataloader=mrpc_dataloader, metric_names=['BinaryF1Score', 'MulticlassAccuracy'] )

glue_mnli_task = Evaluator( label='glue_mnli', dataloader=mnli_dataloader, metric_names=['MulticlassAccuracy'] )

trainer = Trainer( ..., eval_dataloader=[glue_mrpc_task, glue_mnli_task], ... )

The metrics used are defined in your model’s get_metrics() method. For more information, see 📊 Evaluation.

Note

If evaluating with multiple GPUs using a DistributedSampler with drop_last=False, the last batch will contain duplicate samples, which may affect metrics. To avoid this, as long as the dataset passed to the DistributedSampler has a length defined, Composer will correctly drop duplicate samples.

Parameters

export_for_inference(save_format, save_path, save_object_store=None, sample_input=None, transforms=None, input_names=None, output_names=None)[source]#

Export a model for inference.

Parameters

Returns

None

fit(*, train_dataloader=None, train_dataloader_label='train', train_subset_num_batches=None, spin_dataloaders=None, duration=None, reset_time=False, schedulers=None, scale_schedule_ratio=1.0, step_schedulers_every_batch=None, eval_dataloader=None, eval_subset_num_batches=- 1, eval_interval=1, device_train_microbatch_size=None, precision=None)[source]#

Train the model.

The Composer Trainer supports multiple calls to fit(). Any arguments specified during the call to fit() will override the values specified when constructing the Trainer. All arguments are optional, with the following exceptions:

For example, the following are equivalent:

The train_dataloader and duration can be specified

when constructing the Trainer

trainer_1 = Trainer( model=model, train_dataloader=train_dataloader, max_duration="1ep", ) trainer_1.fit()

Or, these arguments can be specified on fit()

trainer_2 = Trainer(model) trainer_2.fit( train_dataloader=train_dataloader, duration="1ep" )

When invoking fit() for a subsequent time, either reset_time or duration must be specified. Otherwise, it is ambiguous for how long to train.

For example:

Construct the trainer

trainer = Trainer(max_duration="1ep")

Train for 1 epoch

trainer.fit() assert trainer.state.timestamp.epoch == "1ep"

Reset the time to 0, then train for 1 epoch

trainer.fit(reset_time=True) assert trainer.state.timestamp.epoch == "1ep"

Train for another epoch (2 epochs total)

trainer.fit(duration="1ep") assert trainer.state.timestamp.epoch == "2ep"

Train for another batch (2 epochs + 1 batch total)

It's OK to switch time units!

trainer.fit(duration="1ba") assert trainer.state.timestamp.epoch == "2ep" assert trainer.state.timestamp.batch_in_epoch == "1ba"

Reset the time, then train for 3 epochs

trainer.fit(reset_time=True, duration="3ep") assert trainer.state.timestamp.epoch == "3ep"

Parameters

predict(dataloader, subset_num_batches=- 1, *, return_outputs=True)[source]#

Output model prediction on the provided data.

There are two ways to access the prediction outputs.

  1. With return_outputs set to True, the batch predictions will be collected into a list and returned.
  2. Via a custom callback, which can be used with return_outputs set to False.
    This technique can be useful if collecting all the outputs from the dataloader would exceed available memory, and you want to write outputs directly to files. For example:
    import os
    import torch
    from torch.utils.data import DataLoader
    from composer import Trainer, Callback
    from composer.loggers import Logger
    class PredictionSaver(Callback):
    def init(self, folder: str):
    self.folder = folder
    os.makedirs(self.folder, exist_ok=True)
    def predict_batch_end(self, state: State, logger: Logger) -> None:
    name = f'batch_{int(state.predict_timestamp.batch)}.pt'
    filepath = os.path.join(self.folder, name)
    torch.save(state.outputs, filepath)
    # Also upload the files
    logger.upload_file(remote_file_name=name, file_path=filepath)

trainer = Trainer(
...,
callbacks=PredictionSaver('./predict_outputs'),
)
trainer.predict(predict_dl, return_outputs=False)
print(sorted(os.listdir('./predict_outputs')))

Parameters

Returns

list – A list of batch outputs, if return_outputs is True. Otherwise, an empty list.

save_checkpoint(name='ep{epoch}-ba{batch}-rank{rank}', *, weights_only=False)[source]#

Checkpoint the training State.

Parameters

Returns

str or None – See save_checkpoint().

save_checkpoint_to_save_folder()[source]#

Checkpoints the training State using a CheckpointSaver if it exists.

Raises

ValueError – If _checkpoint_saver does not exist.

Returns

None

property saved_checkpoints#

Returns list of saved checkpoints.

Note

For DeepSpeed, which saves file on every rank, only the files corresponding to the process’s rank will be shown.