Inputs — sagemaker 2.247.0 documentation (original) (raw)

Amazon SageMaker channel configurations for S3 data sources and file system data sources

class sagemaker.inputs.TrainingInput(s3_data, distribution=None, compression=None, content_type=None, record_wrapping=None, s3_data_type='S3Prefix', instance_groups=None, input_mode=None, attribute_names=None, target_attribute_name=None, shuffle_config=None, hub_access_config=None, model_access_config=None)

Bases: object

Amazon SageMaker channel configurations for S3 data sources.

config

A SageMaker DataSource referencing a SageMaker S3DataSource.

Type:

dict[str, dict]

Create a definition for input data used by an SageMaker training job.

See AWS documentation on the CreateTrainingJob API for more details on the parameters.

Parameters:

s3_data (str or PipelineVariable) – Defines the location of S3 data to train on.
distribution (str or PipelineVariable) – Valid values: 'FullyReplicated','ShardedByS3Key' (default: 'FullyReplicated').
compression (str or PipelineVariable) – Valid values: 'Gzip', None(default: None). This is used only in Pipe input mode.
content_type (str or PipelineVariable) – MIME type of the input data (default: None).
record_wrapping (str or PipelineVariable) – Valid values: ‘RecordIO’ (default: None).
s3_data_type (str or PipelineVariable) – Valid values: 'S3Prefix','ManifestFile', 'AugmentedManifestFile'. If 'S3Prefix', s3_data defines a prefix of s3 objects to train on. All objects with s3 keys beginning with s3_data will be used to train. If 'ManifestFile' or 'AugmentedManifestFile', then s3_data defines a single S3 manifest file or augmented manifest file respectively, listing the S3 data to train on. Both the ManifestFile and AugmentedManifestFile formats are described at S3DataSourcein the Amazon SageMaker API reference.
instance_groups (list_[_str] or list_[_PipelineVariable]) – Optional. A list of instance group names in string format that you specified while configuring a heterogeneous cluster using the sagemaker.instance_group.InstanceGroup. S3 data will be sent to all instance groups in the specified list. For instructions on how to use InstanceGroup objects to configure a heterogeneous cluster through the SageMaker generic and framework estimator classes, seeTrain Using a Heterogeneous Clusterin the Amazon SageMaker developer guide. (default: None)
input_mode (str or PipelineVariable) –
Optional override for this channel’s input mode (default: None). By default, channels will use the input mode defined onsagemaker.estimator.EstimatorBase.input_mode, but they will ignore that setting if this parameter is set.
- None - Amazon SageMaker will use the input mode specified in the Estimator
- ’File’ - Amazon SageMaker copies the training dataset from the S3 location to
  a local directory.
- ’Pipe’ - Amazon SageMaker streams data directly from S3 to the container via
  a Unix-named pipe.
- ’FastFile’ - Amazon SageMaker streams data from S3 on demand instead of
  downloading the entire dataset before training begins.
attribute_names (list_[_str] or list_[_PipelineVariable]) – A list of one or more attribute names to use that are found in a specified AugmentedManifestFile.
target_attribute_name (str or PipelineVariable) – The name of the attribute will be predicted (classified) in a SageMaker AutoML job. It is required if the input is for SageMaker AutoML job.
shuffle_config (sagemaker.inputs.ShuffleConfig) – If specified this configuration enables shuffling on this channel. See the SageMaker API documentation for more info:https://docs.aws.amazon.com/sagemaker/latest/dg/API_ShuffleConfig.html
hub_access_config (dict) – Specify the HubAccessConfig of a Model Reference for which a training job is being created for.
model_access_config (dict) – For models that require a Model Access Config, specify True or False for to indicate whether model terms of use have been accepted. The accept_eula value must be explicitly defined as True in order to accept the end-user license agreement (EULA) that some models require. (Default: None).

add_hub_access_config(hub_access_config=None)

Add Hub Access Config to the channel’s configuration.

Parameters:

hub_access_config (dict) – The HubAccessConfig to be added to the
configuration. (channel's) –

add_model_access_config(model_access_config=None)

Add Model Access Config to the channel’s configuration.

Parameters:

model_access_config (dict) – Whether model terms of use have been accepted.

class sagemaker.inputs.ShuffleConfig(seed)

Bases: object

For configuring channel shuffling using a seed.

For more detail, see the AWS documentation:https://docs.aws.amazon.com/sagemaker/latest/dg/API_ShuffleConfig.html

Create a ShuffleConfig.

Parameters:

seed (long) – the long value used to seed the shuffled sequence.

class sagemaker.inputs.CreateModelInput(instance_type=None, accelerator_type=None)

Bases: object

A class containing parameters which can be used to create a SageMaker Model

Parameters:

instance_type (str) – type or EC2 instance will be used for model deployment.
accelerator_type (str) – elastic inference accelerator type.

Method generated by attrs for class CreateModelInput.

instance_type_: str_

accelerator_type_: str_

class sagemaker.inputs.TransformInput(data, data_type='S3Prefix', content_type=None, compression_type=None, split_type=None, input_filter=None, output_filter=None, join_source=None, model_client_config=None, batch_data_capture_config=None)

Bases: object

Creates a class containing parameters for configuring input data for a batch tramsform job.

It can be used when calling sagemaker.transformer.Transformer.transform()

Parameters:

data (str) – The S3 location of the input data that the model can consume.
data_type (str) – The data type for a batch transform job. (default: 'S3Prefix')
content_type (str) – The multi-purpose internet email extension (MIME) type of the data. (default: None)
compression_type (str) – If your transform data is compressed, specify the compression type. Valid values: 'Gzip', None(default: None)
split_type (str) – The method to use to split the transform job’s data files into smaller batches. Valid values: 'Line', RecordIO, 'TFRecord', None (default: None)
input_filter (str) – A JSONPath expression for selecting a portion of the input data to pass to the algorithm. For example, you can use this parameter to exclude fields, such as an ID column, from the input. If you want SageMaker to pass the entire input dataset to the algorithm, accept the default value $. For more information on batch transform data processing, input, join, and output, seeAssociate Prediction Results with Input Recordsin the Amazon SageMaker developer guide. Example value: $. For more information about valid values for this parameter, seeJSONPath Operatorsin the Amazon SageMaker developer guide. (default: $)
output_filter (str) –
A JSONPath expression for selecting a portion of the joined dataset to save in the output file for a batch transform job. If you want SageMaker to store the entire input dataset in the output file, leave the default value, .Ifyouspecifyindexesthataren’twithinthedimensionsizeofthejoineddataset,yougetanerror.Examplevalue:‘. If you specify indexes that aren’t within the dimension size of the joined dataset, you get an error. Example value: </annotation></semantics></math>.Ifyouspecifyindexesthataren’twithinthedimensionsizeofthejoineddataset,yougetanerror.Examplevalue:‘. For more information about valid values for this parameter, seeJSONPath Operatorsin the Amazon SageMaker developer guide. (default: $)
join_source (str) – Specifies the source of the data to join with the transformed data. The default value is None, which specifies not to join the input with the transformed data. If you want the batch transform job to join the original input data with the transformed data, set to Input. Valid values: None, Input(default: None)
model_client_config (dict) –
Configures the timeout and maximum number of retries for processing a transform job invocation.
- 'InvocationsTimeoutInSeconds' (int) - The timeout value in seconds for an invocation request. The default value is 600.
- 'InvocationsMaxRetries' (int) - The maximum number of retries when invocation requests are failing.
  (default: {600,3})
batch_data_capture_config (dict) – The dict is an object of BatchDataCaptureConfigand specifies configuration related to batch transform job for use with Amazon SageMaker Model Monitoring. For more information, see Capture data from batch transform jobin the Amazon SageMaker developer guide. (default: None)

Method generated by attrs for class TransformInput.

data_: str_

data_type_: str_

content_type_: str_

compression_type_: str_

split_type_: str_

input_filter_: str_

output_filter_: str_

join_source_: str_

model_client_config_: dict_

batch_data_capture_config_: dict_

class sagemaker.inputs.FileSystemInput(file_system_id, file_system_type, directory_path, file_system_access_mode='ro', content_type=None)

Bases: object

Amazon SageMaker channel configurations for file system data sources.

config

A Sagemaker File System DataSource.

Type:

dict[str, dict]

Create a new file system input used by an SageMaker training job.

Parameters:

file_system_id (str) – An Amazon file system ID starting with ‘fs-‘.
file_system_type (str) – The type of file system used for the input. Valid values: ‘EFS’, ‘FSxLustre’.
directory_path (str) – Absolute or normalized path to the root directory (mount point) in the file system. Reference: https://docs.aws.amazon.com/efs/latest/ug/mounting-fs.html andhttps://docs.aws.amazon.com/fsx/latest/LustreGuide/mount-fs-auto-mount-onreboot.html
file_system_access_mode (str) – Permissions for read and write. Valid values: ‘ro’ or ‘rw’. Defaults to ‘ro’.

class sagemaker.inputs.BatchDataCaptureConfig(destination_s3_uri, kms_key_id=None, generate_inference_id=None)

Bases: object

Configuration object passed in when create a batch transform job.

Specifies configuration related to batch transform job data capture for use with Amazon SageMaker Model Monitoring

Create new BatchDataCaptureConfig

Parameters:

destination_s3_uri (str) – S3 Location to store the captured data
kms_key_id (str) – The KMS key to use when writing to S3. KmsKeyId can be an ID of a KMS key, ARN of a KMS key, alias of a KMS key, or alias of a KMS key. The KmsKeyId is applied to all outputs. (default: None)
generate_inference_id (bool) – Flag to generate an inference id (default: None)

The input configs for DatasetDefinition.

DatasetDefinition supports the data sources like S3 which can be queried via Athena and Redshift. A mechanism has to be created for customers to generate datasets from Athena/Redshift queries and to retrieve the data, using Processing jobs so as to make it available for other downstream processes.

class sagemaker.dataset_definition.inputs.RedshiftDatasetDefinition(cluster_id=None, database=None, db_user=None, query_string=None, cluster_role_arn=None, output_s3_uri=None, kms_key_id=None, output_format=None, output_compression=None)

Bases: ApiObject

DatasetDefinition for Redshift.

With this input, SQL queries will be executed using Redshift to generate datasets to S3.

Initialize RedshiftDatasetDefinition.

Parameters:

cluster_id (str, default=None) – The Redshift cluster Identifier.
database (str, default=None) – The name of the Redshift database used in Redshift query execution.
db_user (str, default=None) – The database user name used in Redshift query execution.
query_string (str, default=None) – The SQL query statements to be executed.
cluster_role_arn (str, default=None) – The IAM role attached to your Redshift cluster that Amazon SageMaker uses to generate datasets.
output_s3_uri (str, default=None) – The location in Amazon S3 where the Redshift query results are stored.
kms_key_id (str, default=None) – The AWS Key Management Service (AWS KMS) key that Amazon SageMaker uses to encrypt data from a Redshift execution.
output_format (str, default=None) – The data storage format for Redshift query results. Valid options are “PARQUET”, “CSV”
output_compression (str, default=None) – The compression used for Redshift query results. Valid options are “None”, “GZIP”, “SNAPPY”, “ZSTD”, “BZIP2”

class sagemaker.dataset_definition.inputs.AthenaDatasetDefinition(catalog=None, database=None, query_string=None, output_s3_uri=None, work_group=None, kms_key_id=None, output_format=None, output_compression=None)

Bases: ApiObject

DatasetDefinition for Athena.

With this input, SQL queries will be executed using Athena to generate datasets to S3.

Initialize AthenaDatasetDefinition.

Parameters:

catalog (str, default=None) – The name of the data catalog used in Athena query execution.
database (str, default=None) – The name of the database used in the Athena query execution.
query_string (str, default=None) – The SQL query statements, to be executed.
output_s3_uri (str, default=None) – The location in Amazon S3 where Athena query results are stored.
work_group (str, default=None) – The name of the workgroup in which the Athena query is being started.
kms_key_id (str, default=None) – The AWS Key Management Service (AWS KMS) key that Amazon SageMaker uses to encrypt data generated from an Athena query execution.
output_format (str, default=None) – The data storage format for Athena query results. Valid options are “PARQUET”, “ORC”, “AVRO”, “JSON”, “TEXTFILE”
output_compression (str, default=None) – The compression used for Athena query results. Valid options are “GZIP”, “SNAPPY”, “ZLIB”

class sagemaker.dataset_definition.inputs.DatasetDefinition(data_distribution_type='ShardedByS3Key', input_mode='File', local_path=None, redshift_dataset_definition=None, athena_dataset_definition=None)

Bases: ApiObject

DatasetDefinition input.

Initialize DatasetDefinition.

Parameters:

data_distribution_type (str, default="ShardedByS3Key") – Whether the generated dataset is FullyReplicated or ShardedByS3Key (default).
input_mode (str, default="File") – Whether to use File or Pipe input mode. In File (default) mode, Amazon SageMaker copies the data from the input source onto the local Amazon Elastic Block Store (Amazon EBS) volumes before starting your training algorithm. This is the most commonly used input mode. In Pipe mode, Amazon SageMaker streams input data from the source directly to your algorithm without using the EBS volume.
local_path (str, default=None) – The local path where you want Amazon SageMaker to download the Dataset Definition inputs to run a processing job. LocalPath is an absolute path to the input data. This is a required parameter when AppManaged is False (default).
redshift_dataset_definition – (RedshiftDatasetDefinition, default=None): Configuration for Redshift Dataset Definition input.
athena_dataset_definition – (AthenaDatasetDefinition, default=None): Configuration for Athena Dataset Definition input.

class sagemaker.dataset_definition.inputs.S3Input(s3_uri=None, local_path=None, s3_data_type='S3Prefix', s3_input_mode='File', s3_data_distribution_type='FullyReplicated', s3_compression_type=None)

Bases: ApiObject

Metadata of data objects stored in S3.

Two options are provided: specifying a S3 prefix or by explicitly listing the files in a manifest file and referencing the manifest file’s S3 path. Note: Strong consistency is not guaranteed if S3Prefix is provided here. S3 list operations are not strongly consistent. Use ManifestFile if strong consistency is required.

Initialize S3Input.

Parameters:

s3_uri (str, default=None) – the path to a specific S3 object or a S3 prefix
local_path (str, default=None) – the path to a local directory. If not provided, skips data download by SageMaker platform.
s3_data_type (str, default="S3Prefix") – Valid options are “ManifestFile” or “S3Prefix”.
s3_input_mode (str, default="File") – Valid options are “Pipe”, “File” or “FastFile”.
s3_data_distribution_type (str, default="FullyReplicated") – Valid options are “FullyReplicated” or “ShardedByS3Key”.
s3_compression_type (str, default=None) – Valid options are “None” or “Gzip”.