Debugger — sagemaker 2.247.0 documentation (original) (raw)

Amazon SageMaker Debugger provides full visibility into training jobs of state-of-the-art machine learning models. This SageMaker Debugger module provides high-level methods to set up Debugger configurations to monitor, profile, and debug your training job. Configure the Debugger-specific parameters when constructing a SageMaker estimator to gain visibility and insights into your training job.

Contents

Debugger Rule APIs

class sagemaker.debugger.get_rule_container_image_uri(name, region)

Bases:

Return the Debugger rule image URI for the given AWS Region.

For a full list of rule image URIs, see Use Debugger Docker Images for Built-in or Custom Rules.

Parameters:

region (str) – A string of AWS Region. For example, 'us-east-1'.

Returns:

Formatted image URI for the given AWS Region and the rule container type.

Return type:

str

class sagemaker.debugger.get_default_profiler_processing_job(instance_type=None, volume_size_in_gb=None)

Bases:

Return the default profiler processing job (a rule) with a unique name.

Returns:

The instance of the built-in ProfilerRule.

Return type:

sagemaker.debugger.ProfilerRule

class sagemaker.debugger.rule_configs

A helper module to configure the SageMaker Debugger built-in rules with the Rule classmethods and and the ProfilerRule classmethods.

For a full list of built-in rules, seeList of Debugger Built-in Rules.

This module is imported from the Debugger client library for rule configuration. For more information, seeAmazon SageMaker Debugger RulesConfig.

class sagemaker.debugger.RuleBase(name, image_uri, instance_type, container_local_output_path, s3_output_path, volume_size_in_gb, rule_parameters)

Bases: ABC

The SageMaker Debugger rule base class that cannot be instantiated directly.

Tip

Debugger rule classes inheriting this RuleBase class areRule and ProfilerRule. Do not directly use the rule base class to instantiate a SageMaker Debugger rule. Use the Rule classmethods for debugging and the ProfilerRule classmethods for profiling.

name

The name of the rule.

Type:

str

image_uri

The image URI to use the rule.

Type:

str

instance_type

Type of EC2 instance to use. For example, ‘ml.c4.xlarge’.

Type:

str

container_local_output_path

The local path to store the Rule output.

Type:

str

s3_output_path

The location in S3 to store the output.

Type:

str

volume_size_in_gb

Size in GB of the EBS volume to use for storing data.

Type:

int

rule_parameters

A dictionary of parameters for the rule.

Type:

dict

Method generated by attrs for class RuleBase.

class sagemaker.debugger.Rule(name, image_uri, instance_type, container_local_output_path, s3_output_path, volume_size_in_gb, rule_parameters, collections_to_save, actions=None)

Bases: RuleBase

The SageMaker Debugger Rule class configures debugging rules to debug your training job.

The debugging rules analyze tensor outputs from your training job and monitor conditions that are critical for the success of the training job.

SageMaker Debugger comes pre-packaged with built-in debugging rules. For example, the debugging rules can detect whether gradients are getting too large or too small, or if a model is overfitting. For a full list of built-in rules for debugging, seeList of Debugger Built-in Rules. You can also write your own rules using the custom rule classmethod.

Configure the debugging rules using the following classmethods.

Tip

Use the following Rule.sagemaker class method for built-in debugging rules or the Rule.custom class method for custom debugging rules. Do not directly use the Ruleinitialization method.

classmethod sagemaker(base_config, name=None, container_local_output_path=None, s3_output_path=None, other_trials_s3_input_paths=None, rule_parameters=None, collections_to_save=None, actions=None)

Initialize a Rule object for a built-in debugging rule.

Parameters:

Returns:

An instance of the built-in rule.

Return type:

Rule

Example of how to create a built-in rule instance:

from sagemaker.debugger import Rule, rule_configs

built_in_rules = [ Rule.sagemaker(rule_configs.built_in_rule_name_in_pysdk_format_1()), Rule.sagemaker(rule_configs.built_in_rule_name_in_pysdk_format_2()), ... Rule.sagemaker(rule_configs.built_in_rule_name_in_pysdk_format_n()) ]

You need to replace the built_in_rule_name_in_pysdk_format_* with the names of built-in rules. You can find the rule names at List of Debugger Built-in Rules.

Example of creating a built-in rule instance with adjusting parameter values:

from sagemaker.debugger import Rule, rule_configs

built_in_rules = [ Rule.sagemaker( base_config=rule_configs.built_in_rule_name_in_pysdk_format(), rule_parameters={ "key": "value" } collections_to_save=[ CollectionConfig( name="tensor_collection_name", parameters={ "key": "value" } ) ] ) ]

For more information about setting up the rule_parameters parameter, see List of Debugger Built-in Rules.

For more information about setting up the collections_to_save parameter, see the CollectionConfig class.

classmethod custom(name, image_uri, instance_type, volume_size_in_gb, source=None, rule_to_invoke=None, container_local_output_path=None, s3_output_path=None, other_trials_s3_input_paths=None, rule_parameters=None, collections_to_save=None, actions=None)

Initialize a Rule object for a custom debugging rule.

You can create a custom rule that analyzes tensors emitted during the training of a model and monitors conditions that are critical for the success of a training job. For more information, see Create Debugger Custom Rules for Training Job Analysis.

Parameters:

Returns:

The instance of the custom rule.

Return type:

Rule

prepare_actions(training_job_name)

Prepare actions for Debugger Rule.

Parameters:

training_job_name (str) – The training job name. To be set as the default training job prefix for the StopTraining action if it is specified.

to_debugger_rule_config_dict()

Generates a request dictionary using the parameters provided when initializing object.

Returns:

An portion of an API request as a dictionary.

Return type:

dict

Debugger Configuration APIs

class sagemaker.debugger.CollectionConfig(name, parameters=None)

Bases: object

Creates tensor collections for SageMaker Debugger.

Constructor for collection configuration.

Parameters:

Example of creating a CollectionConfig object:

from sagemaker.debugger import CollectionConfig

collection_configs=[ CollectionConfig(name="tensor_collection_1") CollectionConfig(name="tensor_collection_2") ... CollectionConfig(name="tensor_collection_n") ]

For a full list of Debugger built-in collection, seeDebugger Built in Collections.

Example of creating a CollectionConfig object with parameter adjustment:

You can use the following CollectionConfig template in two ways: (1) to adjust the parameters of the built-in tensor collections, and (2) to create custom tensor collections.

If you put the built-in collection names to the name parameter,CollectionConfig takes it to match the built-in collections and adjust parameters. If you specify a new name to the name parameter,CollectionConfig creates a new tensor collection, and you must useinclude_regex parameter to specify regex of tensors you want to collect.

from sagemaker.debugger import CollectionConfig

collection_configs=[ CollectionConfig( name="tensor_collection", parameters={ "key_1": "value_1", "key_2": "value_2" ... "key_n": "value_n" } ) ]

The following list shows the available CollectionConfig parameters.

Parameter Key Descriptions
include_regex Specify a list of regex patterns of tensors to save. Tensors whose names match these patterns will be saved.
save_histogram Set True if want to save histogram output data for TensorFlow visualization.
reductions Specify certain reduction values of tensors. This helps reduce the amount of data saved and increase training speed. Available values are min, max, median, mean, std, variance, sum, and prod.
save_interval train.save_interval eval.save_interval predict.save_interval global.save_interval Specify how often to save tensors in steps. You can also specify the save intervals in TRAIN, EVAL, PREDICT, and GLOBAL modes. The default value is 500 steps.
save_steps train.save_steps eval.save_steps predict.save_steps global.save_steps Specify the exact step numbers to save tensors. You can also specify the save steps in TRAIN, EVAL, PREDICT, and GLOBAL modes.
start_step train.start_step eval.start_step predict.start_step global.start_step Specify the exact start step to save tensors. You can also specify the start steps in TRAIN, EVAL, PREDICT, and GLOBAL modes.
end_step train.end_step eval.end_step predict.end_step global.end_step Specify the exact end step to save tensors. You can also specify the end steps in TRAIN, EVAL, PREDICT, and GLOBAL modes.

For example, the following code shows how to control the save_interval parameters of the built-in losses tensor collection. With the following collection configuration, Debugger collects loss values every 100 steps from training loops and every 10 steps from evaluation loops.

collection_configs=[ CollectionConfig( name="losses", parameters={ "train.save_interval": "100", "eval.save_interval": "10" } ) ]

class sagemaker.debugger.DebuggerHookConfig(s3_output_path=None, container_local_output_path=None, hook_parameters=None, collection_configs=None)

Bases: object

Create a Debugger hook configuration object to save the tensor for debugging.

DebuggerHookConfig provides options to customize how debugging information is emitted and saved. This high-level DebuggerHookConfig class runs based on the smdebug.SaveConfig class.

Initialize the DebuggerHookConfig instance.

Parameters:

Example of creating a DebuggerHookConfig object:

from sagemaker.debugger import CollectionConfig, DebuggerHookConfig

collection_configs=[ CollectionConfig(name="tensor_collection_1") CollectionConfig(name="tensor_collection_2") ... CollectionConfig(name="tensor_collection_n") ]

hook_config = DebuggerHookConfig( collection_configs=collection_configs )

class sagemaker.debugger.TensorBoardOutputConfig(s3_output_path, container_local_output_path=None)

Bases: object

Create a tensor ouput configuration object for debugging visualizations on TensorBoard.

Initialize the TensorBoardOutputConfig instance.

Parameters: