HiveCopyActivity - AWS Data Pipeline (original) (raw)
Runs a Hive query on an EMR cluster. HiveCopyActivity makes it easier to copy data between DynamoDB tables. HiveCopyActivity accepts a HiveQL statement to filter input data from DynamoDB at the column and row level.
Example
The following example shows how to use HiveCopyActivity andDynamoDBExportDataFormat to copy data from oneDynamoDBDataNode to another, while filtering data, based on a time stamp.
This object is invoked within the execution of a schedule interval. Users must specify a schedule reference to another object to set the dependency execution order for this object. Users can satisfy this requirement by explicitly setting a schedule on the object, for example, by specifying "schedule": {"ref": "DefaultSchedule"}. In most cases, it is better to put the schedule reference on the default pipeline object so that all objects inherit that schedule. Or, if the pipeline has a tree of schedules (schedules within the master schedule), users can create a parent object that has a schedule reference. For more information about example optional schedule configurations, see https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-schedule.html
Reference Object, e.g. "schedule":{"ref":"myScheduleId"}
Required Group (One of the following is required)
Description
Slot Type
runsOn
Specify cluster to run on.
Reference Object, e.g. "runsOn":{"ref":"myResourceId"}
workerGroup
The worker group. This is used for routing tasks. If you provide a runsOn value and workerGroup exists,workerGroup is ignored.
String
Optional Fields
Description
Slot Type
attemptStatus
The most recently reported status from the remote activity.
String
attemptTimeout
The timeout for remote work completion. If set, then a remote activity that does not complete within the set time of starting may be retried.
Period
dependsOn
Specifies the dependency on another runnable object.
Reference Object, e.g. "dependsOn":{"ref":"myActivityId"}
failureAndRerunMode
Describes consumer node behavior when dependencies fail or are rerun.
Enumeration
filterSql
A Hive SQL statement fragment that filters a subset of DynamoDB or Amazon S3 data to copy. The filter should only contain predicates and not begin with a WHERE clause, because AWS Data Pipeline adds it automatically.
String
input
The input data source. This must be a S3DataNode orDynamoDBDataNode. If you useDynamoDBNode, specify aDynamoDBExportDataFormat.
Reference Object, e.g. "input":{"ref":"myDataNodeId"}
lateAfterTimeout
The elapsed time after pipeline start within which the object must complete. It is triggered only when the schedule type is not set to ondemand.
Period
maxActiveInstances
The maximum number of concurrent active instances of a component. Re-runs do not count toward the number of active instances.
Integer
maximumRetries
The maximum number attempt retries on failure.
Integer
onFail
An action to run when current object fails.
Reference Object, e.g. "onFail":{"ref":"myActionId"}
onLateAction
Actions that should be triggered if an object has not yet been scheduled or still not completed.
Reference Object, e.g. "onLateAction":{"ref":"myActionId"}
onSuccess
An action to run when current object succeeds.
Reference Object, e.g. "onSuccess":{"ref":"myActionId"}
output
The output data source. If input is S3DataNode, this must beDynamoDBDataNode. Otherwise, this can be S3DataNode orDynamoDBDataNode. If you useDynamoDBNode, specify aDynamoDBExportDataFormat.
Reference Object, e.g. "output":{"ref":"myDataNodeId"}
parent
The parent of the current object from which slots will be inherited.
Reference Object, e.g. "parent":{"ref":"myBaseObjectId"}
pipelineLogUri
The Amazon S3 URI, such as 's3://BucketName/Key/', for uploading logs for the pipeline.
String
postActivityTaskConfig
The post-activity configuration script to be run. This consists of a URI of the shell script in Amazon S3 and a list of arguments.
Reference Object, e.g. "postActivityTaskConfig":{"ref":"myShellScriptConfigId"}
preActivityTaskConfig
The pre-activity configuration script to be run. This consists of a URI of the shell script in Amazon S3 and a list of arguments.
Reference Object, e.g. "preActivityTaskConfig":{"ref":"myShellScriptConfigId"}
precondition
Optionally defines a precondition. A data node is not marked "READY" until all preconditions have been met.
Reference Object, e.g. "precondition":{"ref":"myPreconditionId"}
reportProgressTimeout
The timeout for remote work successive calls to reportProgress. If set, then remote activities that do not report progress for the specified period may be considered stalled and so retried.
Period
resizeClusterBeforeRunning
Resize the cluster before performing this activity to accommodate DynamoDB data nodes specified as inputs or outputs. NoteIf your activity uses aDynamoDBDataNode as either an input or output data node, and if you set theresizeClusterBeforeRunning toTRUE, AWS Data Pipeline starts usingm3.xlarge instance types. This overwrites your instance type choices withm3.xlarge, which could increase your monthly costs.
Boolean
resizeClusterMaxInstances
A limit on the maximum number of instances that can be requested by the resize algorithm
Integer
retryDelay
The timeout duration between two retry attempts.
Period
scheduleType
Schedule type allows you to specify whether the objects in your pipeline definition should be scheduled at the beginning of interval or end of the interval. Time Series Style Scheduling means instances are scheduled at the end of each interval and Cron Style Scheduling means instances are scheduled at the beginning of each interval. An on-demand schedule allows you to run a pipeline one time per activation. This means you do not have to clone or re-create the pipeline to run it again. If you use an on-demand schedule it must be specified in the default object and must be the only scheduleType specified for objects in the pipeline. To use on-demand pipelines, you simply call the ActivatePipeline operation for each subsequent run. Values are: cron, ondemand, and timeseries.
Enumeration
Runtime Fields
Description
Slot Type
@activeInstances
List of the currently scheduled active instance objects.
Reference Object, e.g. "activeInstances":{"ref":"myRunnableObjectId"}
@actualEndTime
Time when the execution of this object finished.
DateTime
@actualStartTime
Time when the execution of this object started.
DateTime
cancellationReason
The cancellationReason if this object was cancelled.
String
@cascadeFailedOn
Description of the dependency chain the object failed on.
Reference Object, e.g. "cascadeFailedOn":{"ref":"myRunnableObjectId"}
emrStepLog
Amazon EMR step logs available only on EMR activity attempts.
String
errorId
The errorId if this object failed.
String
errorMessage
The errorMessage if this object failed.
String
errorStackTrace
The error stack trace if this object failed.
String
@finishedTime
The time at which this object finished its execution.
DateTime
hadoopJobLog
Hadoop job logs available on attempts for EMR-based activities.
String
@healthStatus
The health status of the object which reflects success or failure of the last object instance that reached a terminated state.
String
@healthStatusFromInstanceId
Id of the last instance object that reached a terminated state.
String
@healthStatusUpdatedTime
Time at which the health status was updated last time.
DateTime
hostname
The host name of client that picked up the task attempt.
String
@lastDeactivatedTime
The time at which this object was last deactivated.
DateTime
@latestCompletedRunTime
Time the latest run for which the execution completed.
DateTime
@latestRunTime
Time the latest run for which the execution was scheduled.
DateTime
@nextRunTime
Time of run to be scheduled next.
DateTime
reportProgressTime
The most recent time that remote activity reported progress.
DateTime
@scheduledEndTime
Schedule end time for object.
DateTime
@scheduledStartTime
Schedule start time for object.
DateTime
@status
The status of this object.
String
@version
Pipeline version the object was created with.
String
@waitingOn
Description of list of dependencies this object is waiting on.
Reference Object, e.g. "waitingOn":{"ref":"myRunnableObjectId"}
System Fields
Description
Slot Type
@error
Error describing the ill-formed object.
String
@pipelineId
Id of the pipeline to which this object belongs to.
String
@sphere
The sphere of an object denotes its place in the lifecycle: Component Objects give rise to Instance Objects which execute Attempt Object.