I’ve used the above guides to tune worker performance.
Grafana Panels to monitor the metrics that matter
avg by(task_queue) (temporal_sticky_cache_size{})
Gives you sticky cache size which may or may not be shared across your workers, depending on the SDK`
avg by(task_queue) (temporal_worker_task_slots_available{worker_type="WorkflowWorker"})
Gives you avg workflow tasks slots available per worker
avg by(task_queue) (temporal_worker_task_slots_available{worker_type="ActivityWorker"})
Gives you avg activity tasks slots available per worker
avg(temporal_workflow_task_schedule_to_start_latency_seconds_count{})
Gives you schedule to start latency for workflow tasks
avg(temporal_activity_schedule_to_start_latency_seconds_count{})
Gives you schedule to start latency for activity tasks
100* avg((poll_success + poll_success_sync)/(poll_success+poll_success_sync+poll_timeouts))
Gives you poll success rate
Actions to be taken w.r.t metrics observed above
if worker resource consumption(CPU,RAM) is low and…
- if sticky_cache_size hits workerCacheSize —> Increase worker cache size
- if available_slots falls —> Increase slots per worker
- if poll success rate AND schedule_to_start_latency both fall, —> You have too many workers
- if available_slots is high AND schedule_to_start_latency is abnormally long and high(longpolling) —> Increase the poller count per worker This is rarely needed, and should be your last resort