K-means — sagemaker 2.199.0 documentation (original) (raw)

sagemaker

The Amazon SageMaker K-means algorithm.

class sagemaker. KMeans(role=None, instance_count=None, instance_type=None, k=None, init_method=None, max_iterations=None, tol=None, num_trials=None, local_init_method=None, half_life_time_size=None, epochs=None, center_factor=None, eval_metrics=None, **kwargs)

Bases: sagemaker.amazon.amazon_estimator.AmazonAlgorithmEstimatorBase

An unsupervised learning algorithm that attempts to find discrete groupings within data.

As the result of KMeans, members of a group are as similar as possible to one another and as different as possible from members of other groups. You define the attributes that you want the algorithm to use to determine similarity.

A k-means clustering class AmazonAlgorithmEstimatorBase.

Finds k clusters of data in an unlabeled dataset.

This Estimator may be fit via calls tofit_ndarray()orfit(). The former allows a KMeans model to be fit on a 2-dimensional numpy array. The latter requires AmazonRecord protobuf serialized data to be stored in S3.

To learn more about the Amazon protobuf Record class and how to prepare bulk data in this format, please consult AWS technical documentation:https://docs.aws.amazon.com/sagemaker/latest/dg/cdf-training.html.

After this Estimator is fit, model data is stored in S3. The model may be deployed to an Amazon SageMaker Endpoint by invokingdeploy(). As well as deploying an Endpoint, deploy returns aKMeansPredictor object that can be used to k-means cluster assignments, using the trained k-means model hosted in the SageMaker Endpoint.

KMeans Estimators can be configured by setting hyperparameters. The available hyperparameters for KMeans are documented below. For further information on the AWS KMeans algorithm, please consult AWS technical documentation:https://docs.aws.amazon.com/sagemaker/latest/dg/k-means.html.

Parameters

Tip

You can find additional parameters for initializing this class atAmazonAlgorithmEstimatorBase andEstimatorBase.

repo_name: str = 'kmeans'

repo_version: str = '1'

CONTAINER_CODE_CHANNEL_SOURCEDIR_PATH = '/opt/ml/input/data/code/sourcedir.tar.gz'

DEFAULT_MINI_BATCH_SIZE = None

INSTANCE_TYPE = 'sagemaker_instance_type'

JOB_CLASS_NAME = 'training-job'

LAUNCH_MPI_ENV_NAME = 'sagemaker_mpi_enabled'

LAUNCH_MWMS_ENV_NAME = 'sagemaker_multi_worker_mirrored_strategy_enabled'

LAUNCH_PS_ENV_NAME = 'sagemaker_parameter_server_enabled'

LAUNCH_PT_XLA_ENV_NAME = 'sagemaker_pytorch_xla_multi_worker_enabled'

LAUNCH_SM_DDP_ENV_NAME = 'sagemaker_distributed_dataparallel_enabled'

MPI_CUSTOM_MPI_OPTIONS = 'sagemaker_mpi_custom_mpi_options'

MPI_NUM_PROCESSES_PER_HOST = 'sagemaker_mpi_num_of_processes_per_host'

SM_DDP_CUSTOM_MPI_OPTIONS = 'sagemaker_distributed_dataparallel_custom_mpi_options'

classmethod attach(training_job_name, sagemaker_session=None, model_channel_name='model')

Attach to an existing training job.

Create an Estimator bound to an existing training job, each subclass is responsible to implement_prepare_init_params_from_job_description() as this method delegates the actual conversion of a training job description to the arguments that the class constructor expects. After attaching, if the training job has a Complete status, it can be deploy() ed to create a SageMaker Endpoint and return a Predictor.

If the training job is in progress, attach will block until the training job completes, but logs of the training job will not display. To see the logs content, please call logs()

Examples

my_estimator.fit(wait=False) training_job_name = my_estimator.latest_training_job.name Later on: attached_estimator = Estimator.attach(training_job_name) attached_estimator.logs() attached_estimator.deploy()

Parameters

Returns

Instance of the calling Estimator Class with the attached training job.

compile_model(target_instance_family, input_shape, output_path, framework=None, framework_version=None, compile_max_run=900, tags=None, target_platform_os=None, target_platform_arch=None, target_platform_accelerator=None, compiler_options=None, **kwargs)

Compile a Neo model using the input model.

Parameters

Returns

A SageMaker Model object. SeeModel() for full details.

Return type

sagemaker.model.Model

property data_location

Placeholder docstring

delete_endpoint(**kwargs)

deploy(initial_instance_count=None, instance_type=None, serializer=None, deserializer=None, accelerator_type=None, endpoint_name=None, use_compiled_model=False, wait=True, model_name=None, kms_key=None, data_capture_config=None, tags=None, serverless_inference_config=None, async_inference_config=None, volume_size=None, model_data_download_timeout=None, container_startup_health_check_timeout=None, inference_recommendation_id=None, explainer_config=None, **kwargs)

Deploy the trained model to an Amazon SageMaker endpoint.

And then return sagemaker.Predictor object.

More information:http://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-training.html

Parameters

Returns

A predictor that provides a predict() method,

which can be used to send requests to the Amazon SageMaker endpoint and obtain inferences.

Return type

sagemaker.predictor.Predictor

disable_profiling()

Update the current training job in progress to disable profiling.

Debugger stops collecting the system and framework metrics and turns off the Debugger built-in monitoring and profiling rules.

enable_default_profiling()

Update training job to enable Debugger monitoring.

This method enables Debugger monitoring with the default profiler_config parameter to collect system metrics and the default built-in profiler_report rule. Framework metrics won’t be saved. To update training job to emit framework metrics, you can useupdate_profilermethod and specify the framework metrics you want to enable.

This method is callable when the training job is in progress while Debugger monitoring is disabled.

enable_network_isolation()

Return True if this Estimator will need network isolation to run.

Returns

Whether this Estimator needs network isolation or not.

Return type

bool

fit(records, mini_batch_size=None, wait=True, logs=True, job_name=None, experiment_config=None)

Fit this Estimator on serialized Record objects, stored in S3.

records should be an instance of RecordSet. This defines a collection of S3 data files to train this Estimator on.

Training data is expected to be encoded as dense or sparse vectors in the “values” feature on each Record. If the data is labeled, the label is expected to be encoded as a list of scalas in the “values” feature of the Record label.

More information on the Amazon Record format is available at:https://docs.aws.amazon.com/sagemaker/latest/dg/cdf-training.html

See record_set() to construct aRecordSet object from ndarray arrays.

Parameters

get_app_url(app_type, open_in_default_web_browser=True, create_presigned_domain_url=False, domain_id=None, user_profile_name=None, optional_create_presigned_url_kwargs=None)

Generate a URL to help access the specified app hosted in Amazon SageMaker Studio.

Parameters

Returns

A URL for the requested app in SageMaker Studio.

Return type

str

get_vpc_config(vpc_config_override='VPC_CONFIG_DEFAULT')

Returns VpcConfig dict either from this Estimator’s subnets and security groups.

Or else validate and return an optional override value.

Parameters

vpc_config_override

latest_job_debugger_artifacts_path()

Gets the path to the DebuggerHookConfig output artifacts.

Returns

An S3 path to the output artifacts.

Return type

str

latest_job_profiler_artifacts_path()

Gets the path to the profiling output artifacts.

Returns

An S3 path to the output artifacts.

Return type

str

latest_job_tensorboard_artifacts_path()

Gets the path to the TensorBoardOutputConfig output artifacts.

Returns

An S3 path to the output artifacts.

Return type

str

logs()

Display the logs for Estimator’s training job.

If the output is a tty or a Jupyter cell, it will be color-coded based on which instance the log entry is from.

property model_data

The model location in S3. Only set if Estimator has been fit().

Type

Str or dict

prepare_workflow_for_training(records=None, mini_batch_size=None, job_name=None)

Calls _prepare_for_training. Used when setting up a workflow.

Parameters

record_set(train, labels=None, channel='train', encrypt=False)

Build a RecordSet from a numpy ndarray matrix and label vector.

For the 2D ndarray train, each row is converted to aRecord object. The vector is stored in the “values” entry of the features property of each Record. If labels is not None, each corresponding label is assigned to the “values” entry of thelabels property of each Record.

The collection of Record objects are protobuf serialized and uploaded to new S3 locations. A manifest file is generated containing the list of objects created and also stored in S3.

The number of S3 objects created is controlled by theinstance_count property on this Estimator. One S3 object is created per training instance.

Parameters

Returns

A RecordSet referencing the encoded, uploading training and label data.

Return type

RecordSet

register(content_types=None, response_types=None, inference_instances=None, transform_instances=None, image_uri=None, model_package_name=None, model_package_group_name=None, model_metrics=None, metadata_properties=None, marketplace_cert=False, approval_status=None, description=None, compile_model_family=None, model_name=None, drift_check_baselines=None, customer_metadata_properties=None, domain=None, sample_payload_url=None, task=None, framework=None, framework_version=None, nearest_model_name=None, data_input_configuration=None, skip_model_validation=None, **kwargs)

Creates a model package for creating SageMaker models or listing on Marketplace.

Parameters

Returns

A string of SageMaker Model Package ARN.

Return type

str

training_image_uri()

Placeholder docstring

property training_job_analytics

Return a TrainingJobAnalytics object for the current training job.

transformer(instance_count, instance_type, strategy=None, assemble_with=None, output_path=None, output_kms_key=None, accept=None, env=None, max_concurrent_transforms=None, max_payload=None, tags=None, role=None, volume_kms_key=None, vpc_config_override='VPC_CONFIG_DEFAULT', enable_network_isolation=None, model_name=None)

Return a Transformer that uses a SageMaker Model based on the training job.

It reuses the SageMaker Session and base job name used by the Estimator.

Parameters

update_profiler(rules=None, system_monitor_interval_millis=None, s3_output_path=None, framework_profile_params=None, disable_framework_metrics=False)

Update training jobs to enable profiling.

This method updates the profiler_config parameter and initiates Debugger built-in rules for profiling.

Parameters

Attention

Updating the profiling configuration for TensorFlow dataloader profiling is currently not available. If you started a TensorFlow training job only with monitoring and want to enable profiling while the training job is running, the dataloader profiling cannot be updated.

eval_metrics: sagemaker.amazon.hyperparameter.Hyperparameter

An algorithm hyperparameter with optional validation.

Implemented as a python descriptor object.

create_model(vpc_config_override='VPC_CONFIG_DEFAULT', **kwargs)

Return a KMeansModel.

It references the latest s3 model data produced by this Estimator.

Parameters

hyperparameters()

Return the SageMaker hyperparameters for training this KMeans Estimator.

class sagemaker. KMeansModel(model_data, role=None, sagemaker_session=None, **kwargs)

Bases: sagemaker.model.Model

Reference KMeans s3 model data.

Calling deploy() creates an Endpoint and return a Predictor to performs k-means cluster assignment.

Initialization for KMeansModel class.

Parameters

class sagemaker. KMeansPredictor(endpoint_name, sagemaker_session=None, serializer=<sagemaker.amazon.common.RecordSerializer object>, deserializer=<sagemaker.amazon.common.RecordDeserializer object>, component_name=None)

Bases: sagemaker.base_predictor.Predictor

Assigns input vectors to their closest cluster in a KMeans model.

The implementation ofpredict() in thisPredictor requires a numpy ndarray as input. The array should contain the same number of columns as the feature-dimension of the data used to fit the model this Predictor performs inference on.

predict() returns a list ofRecord objects (assuming the default recordio-protobuf deserializer is used), one for each row in the input ndarray. The nearest cluster is stored in theclosest_cluster key of the Record.label field.

Initialization for KMeansPredictor class.

Parameters