CheckpointHook — mmengine 0.10.7 documentation (original) (raw)
class mmengine.hooks.CheckpointHook(interval=-1, by_epoch=True, save_optimizer=True, save_param_scheduler=True, out_dir=None, max_keep_ckpts=-1, save_last=True, save_best=None, rule=None, greater_keys=None, less_keys=None, file_client_args=None, filename_tmpl=None, backend_args=None, published_keys=None, save_begin=0, **kwargs)[source]¶
Save checkpoints periodically.
Parameters:
- interval (int) – The saving period. If
by_epoch=True
, interval indicates epochs, otherwise it indicates iterations. Defaults to -1, which means “never”. - by_epoch (bool) – Saving checkpoints by epoch or by iteration. Defaults to True.
- save_optimizer (bool) – Whether to save optimizer state_dict in the checkpoint. It is usually used for resuming experiments. Defaults to True.
- save_param_scheduler (bool) – Whether to save param_scheduler state_dict in the checkpoint. It is usually used for resuming experiments. Defaults to True.
- out_dir (str, Path , Optional) – The root directory to save checkpoints. If not specified,
runner.work_dir
will be used by default. If specified, theout_dir
will be the concatenation ofout_dir
and the last level directory ofrunner.work_dir
. For example, if the inputour_dir
is./tmp
andrunner.work_dir
is./work_dir/cur_exp
, then the ckpt will be saved in./tmp/cur_exp
. Defaults to None. - max_keep_ckpts (int) – The maximum checkpoints to keep. In some cases we want only the latest few checkpoints and would like to delete old ones to save the disk space. Defaults to -1, which means unlimited.
- save_last (bool) – Whether to force the last checkpoint to be saved regardless of interval. Defaults to True.
- save_best (str, List _[_str] , optional) – If a metric is specified, it would measure the best checkpoint during evaluation. If a list of metrics is passed, it would measure a group of best checkpoints corresponding to the passed metrics. The information about best checkpoint(s) would be saved in
runner.message_hub
to keep best score value and best checkpoint path, which will be also loaded when resuming checkpoint. Options are the evaluation metrics on the test dataset. e.g.,bbox_mAP
,segm_mAP
for bbox detection and instance segmentation.AR@100
for proposal recall. Ifsave_best
isauto
, the first key of the returnedOrderedDict
result will be used. Defaults to None. - rule (str, List _[_str] , optional) – Comparison rule for best score. If set to None, it will infer a reasonable rule. Keys such as ‘acc’, ‘top’ .etc will be inferred by ‘greater’ rule. Keys contain ‘loss’ will be inferred by ‘less’ rule. If
save_best
is a list of metrics andrule
is a str, all metrics insave_best
will share the comparison rule. Ifsave_best
andrule
are both lists, their length must be the same, and metrics insave_best
will use the corresponding comparison rule inrule
. Options are ‘greater’, ‘less’, None and list which contains ‘greater’ and ‘less’. Defaults to None. - greater_keys (List _[_str] , optional) – Metric keys that will be inferred by ‘greater’ comparison rule. If
None
, _default_greater_keys will be used. Defaults to None. - less_keys (List _[_str] , optional) – Metric keys that will be inferred by ‘less’ comparison rule. If
None
, _default_less_keys will be used. Defaults to None. - file_client_args (dict, optional) – Arguments to instantiate a FileClient. See mmengine.fileio.FileClient for details. Defaults to None. It will be deprecated in future. Please use
backend_args
instead. - filename_tmpl (str, optional) – String template to indicate checkpoint name. If specified, must contain one and only one “{}”, which will be replaced with
epoch + 1
ifby_epoch=True
elseiteration + 1
. Defaults to None, which means “epoch_{}.pth” or “iter_{}.pth” accordingly. - backend_args (dict, optional) – Arguments to instantiate the prefix of uri corresponding backend. Defaults to None.New in version 0.2.0.
- published_keys (str, List _[_str] , optional) – If
save_last
isTrue
orsave_best
is notNone
, it will automatically publish model with keys in the list after training. Defaults to None.New in version 0.7.1. - save_begin (int) – Control the epoch number or iteration number at which checkpoint saving begins. Defaults to 0, which means saving at the beginning.New in version 0.8.3.
Examples
Save best based on single metric
CheckpointHook(interval=2, by_epoch=True, save_best='acc', rule='less')
Save best based on multi metrics with the same comparison rule
CheckpointHook(interval=2, by_epoch=True, save_best=['acc', 'mIoU'], rule='greater')
Save best based on multi metrics with different comparison rule
CheckpointHook(interval=2, by_epoch=True, save_best=['FID', 'IS'], rule=['less', 'greater'])
Save best based on single metric and publish model after training
CheckpointHook(interval=2, by_epoch=True, save_best='acc', rule='less', published_keys=['meta', 'state_dict'])
Publish the checkpoint after training.
Parameters:
runner (Runner) – The runner of the training process.
Return type:
None
after_train_epoch(runner)[source]¶
Save the checkpoint and synchronize buffers after each epoch.
Parameters:
runner (Runner) – The runner of the training process.
Return type:
None
after_train_iter(runner, batch_idx, data_batch=None, outputs=typing.Optional[dict])[source]¶
Save the checkpoint and synchronize buffers after each iteration.
Parameters:
- runner (Runner) – The runner of the training process.
- batch_idx (int) – The index of the current batch in the train loop.
- data_batch (dict or tuple or list, optional) – Data from dataloader.
- outputs (dict, optional) – Outputs from model.
Return type:
None
after_val_epoch(runner, metrics)[source]¶
Save the checkpoint and synchronize buffers after each evaluation epoch.
Parameters:
- runner (Runner) – The runner of the training process.
- metrics (dict) – Evaluation results of all metrics
Finish all operations, related to checkpoint.
This function will get the appropriate file client, and the directory to save these checkpoints of the model.
Parameters:
runner (Runner) – The runner of the training process.
Return type:
None