Suggested metrics to autoscale Temporal workers on

tihomir · September 2, 2022, 10:18pm

what are the general metrics you recommend autoscaling workers on?

Would use a combination of SDK and server metrics.

SDK:
Starting point can be the worker tuning guide in doc and metrics:

worker_task_slots_available: Gauge metric, defines how many task slots are available for workers to process tasks. It should be > 0, otherwise workers would not be able to keep up processing tasks.
Sample Prometheus query:
avg_over_time(temporal_worker_task_slots_available{namespace="default",worker_type="WorkflowWorker"}[10m])
(or for current value)
temporal_worker_task_slots_available{namespace=“default”, worker_type=“WorkflowWorker”, task_queue=“<your_tq_name>”}
Note worker_type can be WorkflowWorker, ActivityWorker, LocalActivityWorker
workflow_task_schedule_to_start_latency: Histogram metric, latency from when a workflow task is placed on task queue by server to the time your worker picks it up to process it.
Sample Prometheus query:
sum by (namespace, task_queue) (rate(temporal_workflow_task_schedule_to_start_latency_seconds_bucket[5m]))
You would have to define your own alert latency number here per your perf requirements. You want this latency to be as small as possible.
activity_schedule_to_start_latency: Histogram metric, latency from when activity task is placed on task queue by server to the time your activity workers (note it can be same worker that processes your workflow if they are on sam task queue) picks it up to process.
Sample Prometheus query:
sum by (namespace, task_queue) (rate(temporal_activity_schedule_to_start_latency_seconds_bucket[5m]))
Same here, you would want to this be as low as possible and as high as per your performance requirements.
sticky_cache_size: Gauge metric, reports the size of you worker in-memory cache. Your workers have a workflow execution cache, if an execution is in cache your workers do not have to replay the whole wf history to continue workflow code execution when they receive a “go” from server to do so.
You you don’t want this value to go over the set WorkflowCacheSize for specific task queue.
Sample Prometheus query:
max_over_time(temporal_sticky_cache_size{namespace="default"}[10m])
Along with this you could look at temporal_sticky_cache_total_forced_eviction_total counter over time, it’s ok if this is > 0 but you might want to alert if this number jumps over a pre defined threshhold over period of time.
workflow_active_thread_count (note this is relevant only in Java SDK): Gauge metric,
Number of cached workflow threads. You could alert if this number gets close to the set maxWorkflowThreadCount in worker factory options.
Sample Prometheus query:
max_over_time(temporal_workflow_active_thread_count{namespace="default"}[10m])

If you had to pick two SDK metrics that should definitely include in autoscaling logic should be the two latency metrics for activity and workflow tasks.

Will include server metrics info in next reply.

Topic		Replies	Views
Autoscaling Workers Based on Custom Prom Metrics, For one specific activity in the queue Community Support worker , deployment	1	306	March 21, 2024
Strategies for Scaling AWS Services Community Support scaling	9	2246	October 1, 2021
Recommended metrics to use for autoscaling temporal server pods Community Support	1	487	May 4, 2023
Auto scaling worker deployment Community Support python-sdk , scaling , deployment	9	2866	April 4, 2024
Clarification on metrics (client + server) Community Support java-sdk , metrics	14	2695	April 13, 2022

Suggested metrics to autoscale Temporal workers on

Related topics