what are the general metrics you recommend autoscaling workers on?
Would use a combination of SDK and server metrics.
SDK:
Starting point can be the worker tuning guide in doc and metrics:
-
worker_task_slots_available
: Gauge metric, defines how many task slots are available for workers to process tasks. It should be > 0, otherwise workers would not be able to keep up processing tasks.
Sample Prometheus query:
avg_over_time(temporal_worker_task_slots_available{namespace="default",worker_type="WorkflowWorker"}[10m])
(or for current value)
temporal_worker_task_slots_available{namespace=“default”, worker_type=“WorkflowWorker”, task_queue=“<your_tq_name>”}
Note worker_type can be WorkflowWorker, ActivityWorker, LocalActivityWorker -
workflow_task_schedule_to_start_latency
: Histogram metric, latency from when a workflow task is placed on task queue by server to the time your worker picks it up to process it.
Sample Prometheus query:
sum by (namespace, task_queue) (rate(temporal_workflow_task_schedule_to_start_latency_seconds_bucket[5m]))
You would have to define your own alert latency number here per your perf requirements. You want this latency to be as small as possible. -
activity_schedule_to_start_latency
: Histogram metric, latency from when activity task is placed on task queue by server to the time your activity workers (note it can be same worker that processes your workflow if they are on sam task queue) picks it up to process.
Sample Prometheus query:
sum by (namespace, task_queue) (rate(temporal_activity_schedule_to_start_latency_seconds_bucket[5m]))
Same here, you would want to this be as low as possible and as high as per your performance requirements. -
sticky_cache_size
: Gauge metric, reports the size of you worker in-memory cache. Your workers have a workflow execution cache, if an execution is in cache your workers do not have to replay the whole wf history to continue workflow code execution when they receive a “go” from server to do so.
You you don’t want this value to go over the set WorkflowCacheSize for specific task queue.
Sample Prometheus query:
max_over_time(temporal_sticky_cache_size{namespace="default"}[10m])
Along with this you could look attemporal_sticky_cache_total_forced_eviction_total
counter over time, it’s ok if this is > 0 but you might want to alert if this number jumps over a pre defined threshhold over period of time. -
workflow_active_thread_count
(note this is relevant only in Java SDK): Gauge metric,
Number of cached workflow threads. You could alert if this number gets close to the set maxWorkflowThreadCount in worker factory options.
Sample Prometheus query:
max_over_time(temporal_workflow_active_thread_count{namespace="default"}[10m])
If you had to pick two SDK metrics that should definitely include in autoscaling logic should be the two latency metrics for activity and workflow tasks.
Will include server metrics info in next reply.