CheckpointHook — mmengine 0.10.7 documentation (original) (raw)

class mmengine.hooks.CheckpointHook(interval=-1, by_epoch=True, save_optimizer=True, save_param_scheduler=True, out_dir=None, max_keep_ckpts=-1, save_last=True, save_best=None, rule=None, greater_keys=None, less_keys=None, file_client_args=None, filename_tmpl=None, backend_args=None, published_keys=None, save_begin=0, **kwargs)[source]¶

Save checkpoints periodically.

Parameters:

interval (int) – The saving period. If by_epoch=True, interval indicates epochs, otherwise it indicates iterations. Defaults to -1, which means “never”.
by_epoch (bool) – Saving checkpoints by epoch or by iteration. Defaults to True.
save_optimizer (bool) – Whether to save optimizer state_dict in the checkpoint. It is usually used for resuming experiments. Defaults to True.
save_param_scheduler (bool) – Whether to save param_scheduler state_dict in the checkpoint. It is usually used for resuming experiments. Defaults to True.
out_dir (str, Path , Optional) – The root directory to save checkpoints. If not specified, runner.work_dir will be used by default. If specified, the out_dir will be the concatenation of out_dirand the last level directory of runner.work_dir. For example, if the input our_dir is ./tmp and runner.work_dir is./work_dir/cur_exp, then the ckpt will be saved in./tmp/cur_exp. Defaults to None.
max_keep_ckpts (int) – The maximum checkpoints to keep. In some cases we want only the latest few checkpoints and would like to delete old ones to save the disk space. Defaults to -1, which means unlimited.
save_last (bool) – Whether to force the last checkpoint to be saved regardless of interval. Defaults to True.
save_best (str, List _[_str] , optional) – If a metric is specified, it would measure the best checkpoint during evaluation. If a list of metrics is passed, it would measure a group of best checkpoints corresponding to the passed metrics. The information about best checkpoint(s) would be saved in runner.message_hub to keep best score value and best checkpoint path, which will be also loaded when resuming checkpoint. Options are the evaluation metrics on the test dataset. e.g., bbox_mAP, segm_mAP for bbox detection and instance segmentation. AR@100 for proposal recall. If save_best is auto, the first key of the returnedOrderedDict result will be used. Defaults to None.
rule (str, List _[_str] , optional) – Comparison rule for best score. If set to None, it will infer a reasonable rule. Keys such as ‘acc’, ‘top’ .etc will be inferred by ‘greater’ rule. Keys contain ‘loss’ will be inferred by ‘less’ rule. If save_best is a list of metrics and rule is a str, all metrics in save_best will share the comparison rule. If save_best and rule are both lists, their length must be the same, and metrics in save_bestwill use the corresponding comparison rule in rule. Options are ‘greater’, ‘less’, None and list which contains ‘greater’ and ‘less’. Defaults to None.
greater_keys (List _[_str] , optional) – Metric keys that will be inferred by ‘greater’ comparison rule. If None, _default_greater_keys will be used. Defaults to None.
less_keys (List _[_str] , optional) – Metric keys that will be inferred by ‘less’ comparison rule. If None, _default_less_keys will be used. Defaults to None.
file_client_args (dict, optional) – Arguments to instantiate a FileClient. See mmengine.fileio.FileClient for details. Defaults to None. It will be deprecated in future. Please usebackend_args instead.
filename_tmpl (str, optional) – String template to indicate checkpoint name. If specified, must contain one and only one “{}”, which will be replaced with epoch + 1 if by_epoch=True elseiteration + 1. Defaults to None, which means “epoch_{}.pth” or “iter_{}.pth” accordingly.
backend_args (dict, optional) – Arguments to instantiate the prefix of uri corresponding backend. Defaults to None.New in version 0.2.0.
published_keys (str, List _[_str] , optional) – If save_last is Trueor save_best is not None, it will automatically publish model with keys in the list after training. Defaults to None.New in version 0.7.1.
save_begin (int) – Control the epoch number or iteration number at which checkpoint saving begins. Defaults to 0, which means saving at the beginning.New in version 0.8.3.

Examples

Save best based on single metric
CheckpointHook(interval=2, by_epoch=True, save_best='acc', rule='less')

Save best based on multi metrics with the same comparison rule
CheckpointHook(interval=2, by_epoch=True, save_best=['acc', 'mIoU'], rule='greater')

Save best based on multi metrics with different comparison rule
CheckpointHook(interval=2, by_epoch=True, save_best=['FID', 'IS'], rule=['less', 'greater'])

Save best based on single metric and publish model after training
CheckpointHook(interval=2, by_epoch=True, save_best='acc', rule='less', published_keys=['meta', 'state_dict'])

after_train(runner)[source]¶

Publish the checkpoint after training.

Parameters:

runner (Runner) – The runner of the training process.

Return type:

None

after_train_epoch(runner)[source]¶

Save the checkpoint and synchronize buffers after each epoch.

Parameters:

runner (Runner) – The runner of the training process.

Return type:

None

after_train_iter(runner, batch_idx, data_batch=None, outputs=typing.Optional[dict])[source]¶

Save the checkpoint and synchronize buffers after each iteration.

Parameters:

runner (Runner) – The runner of the training process.
batch_idx (int) – The index of the current batch in the train loop.
data_batch (dict or tuple or list, optional) – Data from dataloader.
outputs (dict, optional) – Outputs from model.

Return type:

None

after_val_epoch(runner, metrics)[source]¶

Save the checkpoint and synchronize buffers after each evaluation epoch.

Parameters:

runner (Runner) – The runner of the training process.
metrics (dict) – Evaluation results of all metrics

before_train(runner)[source]¶

Finish all operations, related to checkpoint.

This function will get the appropriate file client, and the directory to save these checkpoints of the model.

Parameters:

runner (Runner) – The runner of the training process.

Return type:

None