Troubleshoot Dataflow autoscaling (original) (raw)

This page shows you how to resolve issues with theDataflow autoscaling featuresand provides information about how to manage autoscaling.

Job doesn't scale up or down

This section provides information about scenarios that might prevent workers from scaling up or down.

Streaming job doesn't scale up

When your streaming pipeline has a backlog, the workers don't scale up.

This issue occurs when the backlog lasts less than a few minutes or when parallelism is limited.

Sometimes, backlog is elevated but parallelism is low. In that case, Dataflow doesn't scale up, because the work can't be distributed across more workers, so adding more workers won't help with processing. For more information, see Streaming autoscaling.

Batch and streaming jobs don't scale up

Your batch or streaming job runs as expected, but when more workers are needed, the job doesn't scale up.

This issue might occur for one of the following reasons:

Streaming job doesn't scale down

When your streaming job has a low backlog and low CPU utilization, the workers don't scale down. This issue can occur for various reasons.

Scaling up stops

Your batch or streaming job starts scaling up, but the workers stop scaling up even though a backlog remains.

This issue occurs when quota limits are reached.

In addition, if the region you choose is out of a resource, you can't create new resources of that type, even if you have remaining quota in your region or project. For example, you might still have quota to create external IP addresses in us-central1, but that region might not have available IP addresses. For more information, see Quotas and resource availability.

To resolve this issue, request a quota increaseor run the job in a different region.

The worker utilization hint has no effect

You set theworker utilization hintbut the autoscaling behavior does not change.

To understand this issue, go to theWorker CPU utilization chartand check whether the worker utilization hint is actively used. If the hint is being used, the chart showsCPU utilization hint (actively used by autoscaler). Otherwise, it showsCPU utilization hint (not actively used by autoscaler).

The utilization hint is only one factor that affects autoscaling. The following table lists some reasons why the autoscaler might not actively use the hint:

Observed scaling behavior Causes Metrics to check
No change You have reached the minimum or maximum number of workers. The number of workers is limited by the number of keys processed in parallel. Jobs are throttled by external RPCs. The downscaling adjustment is too small or Dataflow is dampening downscaling. For more information, see Streaming autoscaling heuristics. job/aggregated_worker_utilization job/max_worker_instances_limit job/min_worker_instances_limit job/processing_parallelism_keys RPC errors reported in worker error logs Crashes reported in worker logs
Scale up High backlog or latency goal is overriding the hint. The minimum number of workers was updated to a higher value than the current number of workers. job/aggregated_worker_utilization job/backlog_bytes job/backlog_elements job/estimated_timer_backlog_processing_time job/min_worker_instances_limit
Scale down The maximum number of workers was updated to a lower value than the current number of workers. job/aggregated_worker_utilization job/max_worker_instances_limit

For more information, seeStreaming autoscaling heuristics.

Gaps in autoscaling metrics

There are short, temporary gaps in autoscaling metrics.

This issue can occur if backend tasks are restarted. These gaps in the metrics don't indicate a problem with autoscaling or the health of your streaming job.

CPU is unevenly distributed

When the job is autoscaling, CPU utilization is unevenly distributed among workers. Some workers have higher CPU utilization, system latency, or data freshness than others.

This issue can occur if your data contains a hot key. A hot key is a key with enough elements to negatively affect pipeline performance. Each key must be processed by a single worker, so the work can't be split between workers.

For more information, see thehot key error guidance.

The work item requesting state read is no longer valid on the backend

During communication between worker VM instances and Streaming Engine tasks in a streaming pipeline, the following error occurs:

The work item requesting state read is no longer valid on the backend.
The work has already completed or will be retried.
This is expected during autoscaling events.

During autoscaling, worker VM instances communicate with multiple Streaming Engine tasks, and each task serves multiple worker VM instances. Item keys are used to distribute the work. Each task and worker VM instance have a collection of key ranges, and the distribution of these ranges can change dynamically. For example, during autoscaling, job resizing can cause the key range distribution to change. When a key range changes, this error can occur. The error is expected, and unless you see a correlation between these messages and an underperforming pipeline, you can ignore it.

Insufficient Streaming Engine resources

If Streaming Engine cannot allocate the minimum number of workers that you request, the following error is returned:

Streaming Engine does not currently have enough resources available to fulfill
the request.

To resolve this issue, try setting a smaller minimum number of workers. SeeSet the autoscaling range.

Scaling range for streaming autoscaling pipelines

This section provides details about the scaling range for streaming autoscaling pipelines.

Java

For streaming autoscaling jobs that don't useStreaming Engine, the Dataflow service allocates between 1 to 15 Persistent Disks to each worker. This allocation means that the minimum number of workers used for a streaming autoscaling pipeline is N/15, where N is the value of --maxNumWorkers.

For streaming autoscaling jobs that useStreaming Engine, the minimum number of workers is 1.

Dataflow balances the number of Persistent Disks between the workers. For example, if your pipeline needs three or four workers in steady state, you could set --maxNumWorkers=15. The pipeline automatically scales between 1 and 15 workers, using 1, 3, 5, or 15 workers, which correspond to 15, 5, 3, or 1 Persistent Disks per worker, respectively.

--maxNumWorkers can be 1000 at most.

Python

For streaming autoscaling jobs that don't useStreaming Engine, the Dataflow service allocates between 1 to 15 Persistent Disks to each worker. This allocation means that the minimum number of workers used for a streaming autoscaling pipeline is N/15, where N is the value of --max_num_workers.

For streaming autoscaling jobs that useStreaming Engine, the minimum number of workers is 1.

Dataflow balances the number of Persistent Disks between the workers. For example, if your pipeline needs three or four workers in steady state, you could set --max_num_workers=15. The pipeline automatically scales between 1 and 15 workers, using 1, 2, 3, 4, 5, 8, or 15 workers, which correspond to 15, 8, 5, 4, 3, 2, or 1 Persistent Disks per worker, respectively.

--max_num_workers can be 1000 at most.

Go

For streaming autoscaling jobs that don't useStreaming Engine, the Dataflow service allocates between 1 to 15 Persistent Disks to each worker. This allocation means that the minimum number of workers used for a streaming autoscaling pipeline is N/15, where N is the value of --max_num_workers.

For streaming autoscaling jobs that useStreaming Engine, the minimum number of workers is 1.

Dataflow balances the number of Persistent Disks between the workers. For example, if your pipeline needs three or four workers in steady state, you could set --max_num_workers=15. The pipeline automatically scales between 1 and 15 workers, using 1, 2, 3, 4, 5, 8, or 15 workers, which correspond to 15, 8, 5, 4, 3, 2, or 1 Persistent Disks per worker, respectively.

--max_num_workers can be 1000 at most.

Maximum number of workers streaming autoscaling might use

Java

Dataflow operates within the limits of the Compute Engine instance count quota of your project or maxNumWorkers, whichever is lower.

Python

Dataflow operates within the limits of the Compute Engine instance count quota of your project or max_num_workers, whichever is lower.

Go

Dataflow operates within the limits of the Compute Engine instance count quota of your project or max_num_workers, whichever is lower.

Limit autoscaling to reduce the impact on billing

If you don't want autoscaling to increase your bill, you can limit the maximum number of workers that your streaming job can use.

Java

By specifying --maxNumWorkers, you limit the scaling range used to process your job.

Python

By specifying --max_num_workers, you limit the scaling range used to process your job.

Go

By specifying --max_num_workers, you limit the scaling range used to process your job.

Change the scaling range

For information about changing the scaling range on a streaming pipeline, seeSet the autoscaling range.

Turn off autoscaling on streaming pipelines

To turn off autoscaling on streaming pipeline, follow these steps.

Use a fixed number of workers

For streaming jobs that don't useStreaming Engine, the default behavior is to use a fixed number of workers. To use streaming autoscaling with these pipelines, you must explicitly opt in as it's not on by default.