Processing — sagemaker 2.199.0 documentation (original) (raw)

sagemaker

This module contains code related to the Processor class.

which is used for Amazon SageMaker Processing Jobs. These jobs let users perform data pre-processing, post-processing, feature engineering, data validation, and model evaluation, and interpretation on Amazon SageMaker.

class sagemaker.processing. Processor(role=None, image_uri=None, instance_count=None, instance_type=None, entrypoint=None, volume_size_in_gb=30, volume_kms_key=None, output_kms_key=None, max_runtime_in_seconds=None, base_job_name=None, sagemaker_session=None, env=None, tags=None, network_config=None)

Bases: object

Handles Amazon SageMaker Processing tasks.

Initializes a Processor instance.

The Processor handles Amazon SageMaker Processing tasks.

Parameters

JOB_CLASS_NAME = 'processing-job'

run(inputs=None, outputs=None, arguments=None, wait=True, logs=True, job_name=None, experiment_config=None, kms_key=None)

Runs a processing job.

Parameters

Returns

None or pipeline step arguments in case the Processor instance is built withPipelineSession

Raises

ValueError – if logs is True but wait is False.

class sagemaker.processing. ScriptProcessor(role=None, image_uri=None, command=None, instance_count=None, instance_type=None, volume_size_in_gb=30, volume_kms_key=None, output_kms_key=None, max_runtime_in_seconds=None, base_job_name=None, sagemaker_session=None, env=None, tags=None, network_config=None)

Bases: sagemaker.processing.Processor

Handles Amazon SageMaker processing tasks for jobs using a machine learning framework.

Initializes a ScriptProcessor instance.

The ScriptProcessor handles Amazon SageMaker Processing tasks for jobs using a machine learning framework, which allows for providing a script to be run as part of the Processing Job.

Parameters

get_run_args(code, inputs=None, outputs=None, arguments=None)

Returns a RunArgs object.

For processors (PySparkProcessor,SparkJar) that have special run() arguments, this object contains the normalized arguments for passing toProcessingStep.

Parameters

run(code, inputs=None, outputs=None, arguments=None, wait=True, logs=True, job_name=None, experiment_config=None, kms_key=None)

Runs a processing job.

Parameters

Returns

None or pipeline step arguments in case the Processor instance is built withPipelineSession

class sagemaker.processing. ProcessingJob(sagemaker_session, job_name, inputs, outputs, output_kms_key=None)

Bases: sagemaker.job._Job

Provides functionality to start, describe, and stop processing jobs.

Initializes a Processing job.

Parameters

classmethod start_new(processor, inputs, outputs, experiment_config)

Starts a new processing job using the provided inputs and outputs.

Parameters

Returns

The instance of ProcessingJob created

using the Processor.

Return type

ProcessingJob

classmethod from_processing_name(sagemaker_session, processing_job_name)

Initializes a ProcessingJob from a processing job name.

Parameters

Returns

The instance of ProcessingJob created

from the job name.

Return type

ProcessingJob

classmethod from_processing_arn(sagemaker_session, processing_job_arn)

Initializes a ProcessingJob from a Processing ARN.

Parameters

Returns

The instance of ProcessingJob created

from the processing job’s ARN.

Return type

ProcessingJob

wait(logs=True)

Waits for the processing job to complete.

Parameters

logs (bool) – Whether to show the logs produced by the job (default: True).

describe()

Prints out a response from the DescribeProcessingJob API call.

stop()

Stops the processing job.

static prepare_app_specification(container_arguments, container_entrypoint, image_uri)

Prepares a dict that represents a ProcessingJob’s AppSpecification.

Parameters

Returns

Represents AppSpecification which configures the processing job to run a specified Docker container image.

Return type

dict

static prepare_output_config(kms_key_id, outputs)

Prepares a dict that represents a ProcessingOutputConfig.

Parameters

Returns

Represents output configuration for the processing job.

Return type

dict

static prepare_processing_resources(instance_count, instance_type, volume_kms_key_id, volume_size_in_gb)

Prepares a dict that represents the ProcessingResources.

Parameters

Returns

Represents ProcessingResources which identifies the resources,

ML compute instances, and ML storage volumes to deploy for a processing job.

Return type

dict

static prepare_stopping_condition(max_runtime_in_seconds)

Prepares a dict that represents the job’s StoppingCondition.

Parameters

max_runtime_in_seconds (int) – Specifies the maximum runtime in seconds.

Returns

dict

class sagemaker.processing. ProcessingInput(source=None, destination=None, input_name=None, s3_data_type='S3Prefix', s3_input_mode='File', s3_data_distribution_type='FullyReplicated', s3_compression_type='None', s3_input=None, dataset_definition=None, app_managed=False)

Bases: object

Accepts parameters that specify an Amazon S3 input for a processing job.

Also provides a method to turn those parameters into a dictionary.

Initializes a ProcessingInput instance.

ProcessingInput accepts parameters that specify an Amazon S3 input for a processing job and provides a method to turn those parameters into a dictionary.

Parameters

class sagemaker.processing. ProcessingOutput(source=None, destination=None, output_name=None, s3_upload_mode='EndOfJob', app_managed=False, feature_store_output=None)

Bases: object

Accepts parameters that specify an Amazon S3 output for a processing job.

It also provides a method to turn those parameters into a dictionary.

Initializes a ProcessingOutput instance.

ProcessingOutput accepts parameters that specify an Amazon S3 output for a processing job and provides a method to turn those parameters into a dictionary.

Parameters

class sagemaker.processing. RunArgs(code, inputs=None, outputs=None, arguments=None)

Bases: object

Accepts parameters that correspond to ScriptProcessors.

An instance of this class is returned from the get_run_args() method on processors, and is used for normalizing the arguments so that they can be passed toProcessingStep

Parameters

Return type

None

Method generated by attrs for class RunArgs.

class sagemaker.processing. FeatureStoreOutput(**kwargs)

Bases: sagemaker.apiutils._base_types.ApiObject

Configuration for processing job outputs in Amazon SageMaker Feature Store.

Init ApiObject.

feature_group_name = None

class sagemaker.processing. FrameworkProcessor(estimator_cls, framework_version, role=None, instance_count=None, instance_type=None, py_version='py3', image_uri=None, command=None, volume_size_in_gb=30, volume_kms_key=None, output_kms_key=None, code_location=None, max_runtime_in_seconds=None, base_job_name=None, sagemaker_session=None, env=None, tags=None, network_config=None)

Bases: sagemaker.processing.ScriptProcessor

Handles Amazon SageMaker processing tasks for jobs using a machine learning framework.

Initializes a FrameworkProcessor instance.

The FrameworkProcessor handles Amazon SageMaker Processing tasks for jobs using a machine learning framework, which allows for a set of Python scripts to be run as part of the Processing Job.

Parameters

framework_entrypoint_command = ['/bin/bash']

get_run_args(code, source_dir=None, dependencies=None, git_config=None, inputs=None, outputs=None, arguments=None, job_name=None)

Returns a RunArgs object.

This object contains the normalized inputs, outputs and arguments needed when using a FrameworkProcessor in a ProcessingStep.

Parameters

run(code, source_dir=None, dependencies=None, git_config=None, inputs=None, outputs=None, arguments=None, wait=True, logs=True, job_name=None, experiment_config=None, kms_key=None)

Runs a processing job.

Parameters

Returns

None or pipeline step arguments in case the Processor instance is built withPipelineSession

This module is the entry to run spark processing script.

This module contains code related to Spark Processors, which are used for Processing jobs. These jobs let customers perform data pre-processing, post-processing, feature engineering, data validation, and model evaluation on SageMaker using Spark and PySpark.

class sagemaker.spark.processing. PySparkProcessor(role=None, instance_type=None, instance_count=None, framework_version=None, py_version=None, container_version=None, image_uri=None, volume_size_in_gb=30, volume_kms_key=None, output_kms_key=None, configuration_location=None, dependency_location=None, max_runtime_in_seconds=None, base_job_name=None, sagemaker_session=None, env=None, tags=None, network_config=None)

Bases: sagemaker.spark.processing._SparkProcessorBase

Handles Amazon SageMaker processing tasks for jobs using PySpark.

Initialize an PySparkProcessor instance.

The PySparkProcessor handles Amazon SageMaker processing tasks for jobs using SageMaker PySpark.

Parameters

get_run_args(submit_app, submit_py_files=None, submit_jars=None, submit_files=None, inputs=None, outputs=None, arguments=None, job_name=None, configuration=None, spark_event_logs_s3_uri=None)

Returns a RunArgs object.

This object contains the normalized inputs, outputs and arguments needed when using a PySparkProcessor in aProcessingStep.

Parameters

run(submit_app, submit_py_files=None, submit_jars=None, submit_files=None, inputs=None, outputs=None, arguments=None, wait=True, logs=True, job_name=None, experiment_config=None, configuration=None, spark_event_logs_s3_uri=None, kms_key=None)

Runs a processing job.

Parameters

class sagemaker.spark.processing. SparkJarProcessor(role=None, instance_type=None, instance_count=None, framework_version=None, py_version=None, container_version=None, image_uri=None, volume_size_in_gb=30, volume_kms_key=None, output_kms_key=None, configuration_location=None, dependency_location=None, max_runtime_in_seconds=None, base_job_name=None, sagemaker_session=None, env=None, tags=None, network_config=None)

Bases: sagemaker.spark.processing._SparkProcessorBase

Handles Amazon SageMaker processing tasks for jobs using Spark with Java or Scala Jars.

Initialize a SparkJarProcessor instance.

The SparkProcessor handles Amazon SageMaker processing tasks for jobs using SageMaker Spark.

Parameters

get_run_args(submit_app, submit_class=None, submit_jars=None, submit_files=None, inputs=None, outputs=None, arguments=None, job_name=None, configuration=None, spark_event_logs_s3_uri=None)

Returns a RunArgs object.

This object contains the normalized inputs, outputs and arguments needed when using a SparkJarProcessor in aProcessingStep.

Parameters

run(submit_app, submit_class, submit_jars=None, submit_files=None, inputs=None, outputs=None, arguments=None, wait=True, logs=True, job_name=None, experiment_config=None, configuration=None, spark_event_logs_s3_uri=None, kms_key=None)

Runs a processing job.

Parameters

class sagemaker.spark.processing. FileType(value)

Bases: enum.Enum

Enum of file type

JAR = 1

PYTHON = 2

FILE = 3

class sagemaker.spark.processing. SparkConfigUtils

Bases: object

Util class for spark configurations

static validate_configuration(configuration)

Validates the user-provided Hadoop/Spark/Hive configuration.

This ensures that the list or dictionary the user provides will serialize to JSON matching the schema of EMR’s application configuration

Parameters

configuration (Dict) – A dict that contains the configuration overrides to the default values. For more information, please visit:https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-configure-apps.html

static validate_s3_uri(spark_output_s3_path)

Validate whether the URI uses an S3 scheme.

In the future, this validation will perform deeper S3 validation.

Parameters

spark_output_s3_path (str) – The URI of the Spark output S3 Path.

This module configures the SageMaker Clarify bias and model explainability processor jobs.

SageMaker Clarify

class sagemaker.clarify. DatasetType(value)

Bases: enum.Enum

Enum to store different dataset types supported in the Analysis config file

TEXTCSV = 'text/csv'

JSONLINES = 'application/jsonlines'

JSON = 'application/json'

PARQUET = 'application/x-parquet'

IMAGE = 'application/x-image'

class sagemaker.clarify. SegmentationConfig(name_or_index, segments, config_name=None, display_aliases=None)

Bases: object

Config object that defines segment(s) of the dataset on which metrics are computed.

Initializes a segmentation configuration for a dataset column.

Parameters

Raises

ValueError – when the name_or_index is None, segments is invalid, or a wrong number of display_aliases are specified.

to_dict()

Returns SegmentationConfig as a dict.

Return type

Dict[str, Any]

class sagemaker.clarify. DataConfig(s3_data_input_path, s3_output_path, s3_analysis_config_output_path=None, label=None, headers=None, features=None, dataset_type='text/csv', s3_compression_type='None', joinsource=None, facet_dataset_uri=None, facet_headers=None, predicted_label_dataset_uri=None, predicted_label_headers=None, predicted_label=None, excluded_columns=None, segmentation_config=None)

Bases: object

Config object related to configurations of the input and output dataset.

Initializes a configuration of both input and output datasets.

Parameters

Raises

ValueError – when the dataset_type is invalid, predicted label dataset parameters are used with un-supported dataset_type, or facet dataset parameters are used with un-supported dataset_type

get_config()

Returns part of an analysis config dictionary.

class sagemaker.clarify. BiasConfig(label_values_or_threshold, facet_name, facet_values_or_threshold=None, group_name=None)

Bases: object

Config object with user-defined bias configurations of the input dataset.

Initializes a configuration of the sensitive groups in the dataset.

Parameters

Raises

ValueError – If the number of facet_names doesn’t equal number of facet values

get_config()

Returns a dictionary of bias detection configurations, part of the analysis config

class sagemaker.clarify. ModelConfig(model_name=None, instance_count=None, instance_type=None, accept_type=None, content_type=None, content_template=None, record_template=None, custom_attributes=None, accelerator_type=None, endpoint_name_prefix=None, target_model=None, endpoint_name=None)

Bases: object

Config object related to a model and its endpoint to be created.

Initializes a configuration of a model and the endpoint to be created for it.

Parameters

Raises

ValueError – when the - endpoint_name_prefix is invalid, - accept_type is invalid, - content_type is invalid, - content_template has no placeholder “features” - both [endpoint_name] AND [model_name, instance_count, instance_type] are set - both [endpoint_name] AND [endpoint_name_prefix] are set

get_predictor_config()

Returns part of the predictor dictionary of the analysis config.

class sagemaker.clarify. ModelPredictedLabelConfig(label=None, probability=None, probability_threshold=None, label_headers=None)

Bases: object

Config object to extract a predicted label from the model output.

Initializes a model output config to extract the predicted label or predicted score(s).

The following examples show different parameter configurations depending on the endpoint:

Parameters

Raises

TypeError – when the probability_threshold cannot be cast to a float

get_predictor_config()

Returns probability_threshold and predictor config dictionary.

class sagemaker.clarify. ExplainabilityConfig

Bases: abc.ABC

Abstract config class to configure an explainability method.

abstract get_explainability_config()

Returns config.

class sagemaker.clarify. PDPConfig(features=None, grid_resolution=15, top_k_features=10)

Bases: sagemaker.clarify.ExplainabilityConfig

Config class for Partial Dependence Plots (PDP).

PDPsshow the marginal effect (the dependence) a subset of features has on the predicted outcome of an ML model.

When PDP is requested (by passing in a PDPConfig to theexplainability_config parameter of SageMakerClarifyProcessor), the Partial Dependence Plots are included in the outputreportand the corresponding values are included in the analysis output.

Initializes PDP config.

Parameters

get_explainability_config()

Returns PDP config dictionary.

class sagemaker.clarify. TextConfig(granularity, language)

Bases: object

Config object to handle text features for text explainability

SHAP analysisbreaks down longer text into chunks (e.g. tokens, sentences, or paragraphs) and replaces them with the strings specified in the baseline for that feature. The shap valueof a chunk then captures how much replacing it affects the prediction.

Initializes a text configuration.

Parameters

Raises

ValueError – when granularity is not in list of supported values or language is not in list of supported values

get_text_config()

Returns a text config dictionary, part of the analysis config dictionary.

class sagemaker.clarify. ImageConfig(model_type, num_segments=None, feature_extraction_method=None, segment_compactness=None, max_objects=None, iou_threshold=None, context=None)

Bases: object

Config object for handling images

Initializes a config object for Computer Vision (CV) Image explainability.

SHAP for CV explainability. generating heat maps that visualize feature attributions for input images. These heat maps highlight the image’s features according to how much they contribute to the CV model prediction.

"IMAGE_CLASSIFICATION" and "OBJECT_DETECTION" are the two supported CV use cases.

Parameters

get_image_config()

Returns the image config part of an analysis config dictionary.

class sagemaker.clarify. SHAPConfig(baseline=None, num_samples=None, agg_method=None, use_logit=False, save_local_shap_values=True, seed=None, num_clusters=None, text_config=None, image_config=None, features_to_explain=None)

Bases: sagemaker.clarify.ExplainabilityConfig

Config class for SHAP.

The SHAP algorithm calculates feature attributions by computing the contribution of each feature to the prediction outcome, using the concept ofShapley values.

These attributions can be provided for specific predictions (locally) and at a global level for the model as a whole.

Initializes config for SHAP analysis.

Parameters

Raises

ValueError – when agg_method is invalid, baseline and num_clusters are provided together, or features_to_explain is specified when text_config orimage_config is provided

get_explainability_config()

Returns a shap config dictionary.

class sagemaker.clarify. SageMakerClarifyProcessor(role=None, instance_count=None, instance_type=None, volume_size_in_gb=30, volume_kms_key=None, output_kms_key=None, max_runtime_in_seconds=None, sagemaker_session=None, env=None, tags=None, network_config=None, job_name_prefix=None, version=None, skip_early_validation=False)

Bases: sagemaker.processing.Processor

Handles SageMaker Processing tasks to compute bias metrics and model explanations.

Initializes a SageMakerClarifyProcessor to compute bias metrics and model explanations.

Instance of Processor.

Parameters

run(**_)

Overriding the base class method but deferring to specific run_* methods.

run_pre_training_bias(data_config, data_bias_config, methods='all', wait=True, logs=True, job_name=None, kms_key=None, experiment_config=None)

Runs a ProcessingJob to compute pre-training bias methods

Computes the requested methods on the input data. The methods compare metrics (e.g. fraction of examples) for the sensitive group(s) vs. the other examples.

Parameters

run_post_training_bias(data_config, data_bias_config, model_config=None, model_predicted_label_config=None, methods='all', wait=True, logs=True, job_name=None, kms_key=None, experiment_config=None)

Runs a ProcessingJob to compute posttraining bias

Spins up a model endpoint and runs inference over the input dataset in the s3_data_input_path (from the DataConfig) to obtain predicted labels. Using model predictions, computes the requested posttraining biasmethods that compare metrics (e.g. accuracy, precision, recall) for the sensitive group(s) versus the other examples.

Parameters

run_bias(data_config, bias_config, model_config=None, model_predicted_label_config=None, pre_training_methods='all', post_training_methods='all', wait=True, logs=True, job_name=None, kms_key=None, experiment_config=None)

Runs a ProcessingJob to compute the requested bias methods

Computes metrics for both the pre-training and the post-training methods. To calculate post-training methods, it spins up a model endpoint and runs inference over the input examples in ‘s3_data_input_path’ (from the DataConfig) to obtain predicted labels.

Parameters

run_explainability(data_config, model_config, explainability_config, model_scores=None, wait=True, logs=True, job_name=None, kms_key=None, experiment_config=None)

Runs a ProcessingJob computing feature attributions.

Spins up a model endpoint.

Currently, only SHAP and Partial Dependence Plots (PDP) are supported as explainability methods. You can request both methods or one at a time with the explainability_config parameter.

When SHAP is requested in the explainability_config, the SHAP algorithm calculates the feature importance for each input example in the s3_data_input_path of the DataConfig, by creating num_samples copies of the example with a subset of features replaced with values from the baseline. It then runs model inference to see how the model’s prediction changes with the replaced features. If the model output returns multiple scores importance is computed for each score. Across examples, feature importance is aggregated using agg_method.

When PDP is requested in the explainability_config, the PDP algorithm calculates the dependence of the target response on the input features and marginalizes over the values of all other input features. The Partial Dependence Plots are included in the outputreportand the corresponding values are included in the analysis output.

Parameters

run_bias_and_explainability(data_config, model_config, explainability_config, bias_config, pre_training_methods='all', post_training_methods='all', model_predicted_label_config=None, wait=True, logs=True, job_name=None, kms_key=None, experiment_config=None)

Runs a ProcessingJob computing feature attributions.

For bias: Computes metrics for both the pre-training and the post-training methods. To calculate post-training methods, it spins up a model endpoint and runs inference over the input examples in ‘s3_data_input_path’ (from the DataConfig) to obtain predicted labels.

For Explainability: Spins up a model endpoint.

Currently, only SHAP and Partial Dependence Plots (PDP) are supported as explainability methods. You can request both methods or one at a time with the explainability_config parameter.

When SHAP is requested in the explainability_config, the SHAP algorithm calculates the feature importance for each input example in the s3_data_input_path of the DataConfig, by creating num_samples copies of the example with a subset of features replaced with values from the baseline. It then runs model inference to see how the model’s prediction changes with the replaced features. If the model output returns multiple scores importance is computed for each score. Across examples, feature importance is aggregated using agg_method.

When PDP is requested in the explainability_config, the PDP algorithm calculates the dependence of the target response on the input features and marginalizes over the values of all other input features. The Partial Dependence Plots are included in the outputreportand the corresponding values are included in the analysis output.

Parameters

class sagemaker.clarify. ProcessingOutputHandler

Bases: object

Class to handle the parameters for SagemakerProcessor.Processingoutput

class S3UploadMode(value)

Bases: enum.Enum

Enum values for different uplaod modes to s3 bucket

CONTINUOUS = 'Continuous'

ENDOFJOB = 'EndOfJob'

classmethod get_s3_upload_mode(analysis_config)

Fetches s3_upload mode based on the shap_config values

Parameters

analysis_config (dict) – dict Config following the analysis_config.json format

Returns

The s3_upload_mode type for the processing output.

Return type

str