Suggested metrics to autoscale Temporal workers on

Hi what are the general metrics you recommend autoscaling workers on? I currently have 3 workers, each working on a separate task queue. 2 of them only execute activities, while 1 only executes workflows. I saw posts seeming to recommend autoscaling on schedule_to_start latency, but I see multiple schedule_to_start latency metrics, namely:

temporal.task_schedule_to_start_latency
temporal_activity_schedule_to_start_latency
temporal_workflow_task_schedule_to_start_latency

Do you have any recommendations which metrics we should be autoscaling on, and for which workers (given that some are activities only and some workflows only)? Can you also explain in layman’s terms what the task_schedule_to_start?

Also curious, if there are multiple retries on an activity, does schedule_to_start indicate the time until the successful retry?

Let’s say I have an activity that runs 5 times (because of failures but is retried each time), but is scheduled first at 12. The first attempt is at 12:02, indicating a schedule to start latency of 2s. Does this metric get bumped on each subsequent retry, or does it maintain it at 2s?

2 Likes

what are the general metrics you recommend autoscaling workers on?

Would use a combination of SDK and server metrics.

SDK:
Starting point can be the worker tuning guide in doc and metrics:

  1. worker_task_slots_available: Gauge metric, defines how many task slots are available for workers to process tasks. It should be > 0, otherwise workers would not be able to keep up processing tasks.
    Sample Prometheus query:
    avg_over_time(temporal_worker_task_slots_available{namespace="default",worker_type="WorkflowWorker"}[10m])
    (or for current value)
    temporal_worker_task_slots_available{namespace=“default”, worker_type=“WorkflowWorker”, task_queue=“<your_tq_name>”}
    Note worker_type can be WorkflowWorker, ActivityWorker, LocalActivityWorker

  2. workflow_task_schedule_to_start_latency: Histogram metric, latency from when a workflow task is placed on task queue by server to the time your worker picks it up to process it.
    Sample Prometheus query:
    sum by (namespace, task_queue) (rate(temporal_workflow_task_schedule_to_start_latency_seconds_bucket[5m]))
    You would have to define your own alert latency number here per your perf requirements. You want this latency to be as small as possible.

  3. activity_schedule_to_start_latency: Histogram metric, latency from when activity task is placed on task queue by server to the time your activity workers (note it can be same worker that processes your workflow if they are on sam task queue) picks it up to process.
    Sample Prometheus query:
    sum by (namespace, task_queue) (rate(temporal_activity_schedule_to_start_latency_seconds_bucket[5m]))
    Same here, you would want to this be as low as possible and as high as per your performance requirements.

  4. sticky_cache_size: Gauge metric, reports the size of you worker in-memory cache. Your workers have a workflow execution cache, if an execution is in cache your workers do not have to replay the whole wf history to continue workflow code execution when they receive a “go” from server to do so.
    You you don’t want this value to go over the set WorkflowCacheSize for specific task queue.
    Sample Prometheus query:
    max_over_time(temporal_sticky_cache_size{namespace="default"}[10m])
    Along with this you could look at temporal_sticky_cache_total_forced_eviction_total counter over time, it’s ok if this is > 0 but you might want to alert if this number jumps over a pre defined threshhold over period of time.

  5. workflow_active_thread_count (note this is relevant only in Java SDK): Gauge metric,
    Number of cached workflow threads. You could alert if this number gets close to the set maxWorkflowThreadCount in worker factory options.
    Sample Prometheus query:
    max_over_time(temporal_workflow_active_thread_count{namespace="default"}[10m])

If you had to pick two SDK metrics that should definitely include in autoscaling logic should be the two latency metrics for activity and workflow tasks.

Will include server metrics info in next reply.

Server metrics:

  1. Sync match rate, Matching service metrics. Measures the rate of tasks that can be delivered to workers without having to be persisted (workers are up and available to pick them up) to the rate of all delivered tasks. You want this to be > 95% ideally 99%. If sync match rate is low you should consider increasing worker capacity.
    Metrics:
    poll_success_sync -
    poll_success - total tasks delivered to workers
    Sample query:
    sum(rate(poll_success_sync{namespace="<namespace_name>"}[1m])) / sum(rate(poll_success{}[1m]))

  2. task_schedule_to_start_latency - timer metric, latency between when task is scheduled and when delivered to your workers. If this latency increases its a good indication to add/scale up your workers.
    You can use the mentioned SDK metrics workflow_task_schedule_to_start_latency and activity_schedule_to_start_latency instead of this if you wanted.
    Sample query:
    histogram_quantile(0.95, sum(rate(task_schedule_to_start_latency_bucket{namespace="default", taskqueue="metricsqueue"}[10m])) by (task_type, le))

Worker container/pod CPU utilization:
In addition to temporal metrics you should also measure your worker containers/pods CPU utilization. This in addition to schedule to start latencies (SDK metrics) can give you a pretty good indication to scale up workers (for example when your cpu utilization > defined threshold or schedule to start latencies are high, or to scale down when cpu utilization is low).

Hope this helps you get started.

Gotcha, appreciate the reply. I still have the question of if there are multiple retries on an activity, does schedule_to_start indicate the time until the successful/last retry, or until the first attempt?

Let’s say I have an activity that is retried 5 times, but is scheduled first at 12. The first attempt is at 12:02, indicating a schedule to start latency of 2s. Does this metric get bumped on each subsequent retry (in theory resulting in a metric of 10s), or does it maintain it at 2s?

Hi!

I came across this thread and I was curious about the queries presented here:

Particularly for items 2 and 3, I wanted to ask what the output of this particular query is:

sum by (namespace, task_queue) (rate(temporal_[workflow_task|activity]_schedule_to_start_latency_seconds_bucket[5m]))
  1. Is the output the number of jobs that remain in the queue for the last 5 minutes?
  2. Or is this a time-related value (seconds, milliseconds)?
  3. Why do we need this value to be as low as possible?
  4. And what’s the difference between using sum by (namespace, task_queue vs sum by (namespace, task_queue, le) ?

I apologize for the basic questions. I am new to Prometheus metrics and I wanted to understand what the particular query means and how we can use it to scale our services.

Thank you so much and have a nice day!

Prometheus rate function calculates per-second average rate of increase over time.
The query you mentioned would show the rate of increase of workflow and activity schedule to start latencies, meaning how long workflow and activity tasks are sitting on the server (matching service task queues) before your worker pollers are able to poll and process them.
Seeing an increase in these latencies can indicate that your workers are unable to handler the load (number of tasks) / are not able to poll them, for example don’t have capacity to process them, or workers are down.

  1. It’s the amount of time (latency) of workflow/activity tasks waiting in task queues for worker pollers to pick them up for processing.
  2. Yes, histogram buckets in Prometheus are default in seconds and fractions of seconds.
  3. I think you don’t want these latencies to be too high as it can indicate issues, at the same time if you see them too low can indicate overprovisioned workers that might not be a bad thing. Think important is to check these latencies during normal operations workload as well as your peak load.
  4. I think if you wanted to in addition sum by histogram bucket it should be ok, did you give it a try?

Thanks for the wonderful explanation!

so for the last item (#4), you’re right. Adding le just segregates the summations by bucket instead of aggregating the values for each bucket.

If we sum rates for all buckets, will that not create duplicate values though since for example, a bucket for 10,000ms can have the same rate as a bucket for +Inf? Does it actually matter?

Thanks again for answering my questions!

Believe you are correct, for your use case you could use histogram_quantile.

For example:

histogram_quantile(0.95, sum(rate(temporal_activity_schedule_to_start_latency_bucket[5m])) by (namespace, activity_type, le))

1 Like

So to clarify, this quantile metric would give us a value (in milliseconds) of how long the longest schedule to start latencies are (top 5%) and how long latencies are for the rest of the activities (other 95%) ?

I tried this metric and I got a value of 77,000 in one instance and 62,000 in another (different evaluation times). I would assume these values are in milliseconds? Also, I tried to look at the execution duration times (by getting all start and close times in Temporal CLI and getting the difference for all) to see if there were any duration times that were right around the two aforementioned lengths but all the workflows finished between 19 and 38 seconds only. May I know how do we interpret the two values returned by the query?

For context, part of this workflow spawns 25 activities of the same kind and there’s only one worker to process all 25 activities (we’re still trying to figure out scaling at this point).

Thanks!

Hi,

Thanks for such detailed explanation on worker tuning.

I read several articles such as introduction-to-worker-tuning, worker-performance, and I understand that how scaling up works, but I have a question about scaling down.

Say I am deploying Temporal workers as a K8s pods in a K8s Deployment, and using Horizontal Pod Autoscaler to scale the deployment up/down. When worker_task_slots_available or Schedule-To-Start-Latency is high, the K8s deployment will scale up. I am not quite sure what happens when the deployment needs to be scaled down? Is it possible that the K8s HPA kills some of the pods while it is executing some activities? And if so, is that okay since the activity will not complete, and Temporal will re-run it in another worker later?

Thanks

1 Like