CheckpointSaver (original) (raw)

Edit this page

Toggle table of contents sidebar

class composer.callbacks.CheckpointSaver(folder='{run_name}/checkpoints', filename='ep{epoch}-ba{batch}-rank{rank}.pt', remote_file_name='{run_name}/checkpoints/ep{epoch}-ba{batch}-rank{rank}.pt', latest_filename='latest-rank{rank}.pt', latest_remote_file_name='{run_name}/checkpoints/latest-rank{rank}.pt', save_interval='1ep', *, overwrite=False, num_checkpoints_to_keep=- 1, weights_only=False, ignore_keys=None, num_concurrent_uploads=1, upload_timeout_in_seconds=3600)[source]#

Callback to save checkpoints.

Note

If the folder argument is specified when constructing the Trainer, then the CheckpointSavercallback need not be constructed manually. However, for advanced checkpointing use cases (such as saving a weights-only checkpoint at one interval and the full training state at another interval), instance(s) of this CheckpointSaver callback can be specified in thecallbacks argument of the Trainer, as shown in the example below.

Example

trainer = Trainer(..., callbacks=[ ... CheckpointSaver( ... folder='{run_name}/checkpoints', ... filename="ep{epoch}-ba{batch}-rank{rank}", ... latest_filename="latest-rank{rank}", ... save_interval="1ep", ... weights_only=False, ... ) ... ])

Parameters

saved_checkpoints#

The checkpoint timestamps and filepaths.

This list contains tuples of the save timestamp and the checkpoint filepaths. This list will have at most num_checkpoints_to_keep entries. The latest checkpoint will be at the end.

Note

When using DeepSpeed, the index of a filepath in each list corresponds to the global rank of the process that wrote that file. Each filepath is valid only on the process’s (rank’s) node.

Otherwise, when not using DeepSpeed, each sub-list will contain only one filepath since only rank zero saves checkpoints.

Type

list[tuple[Timestamp, list[Path]]]