How To Identify And Tune Worker Bottlenecks

I’ve used the above guides to tune worker performance.

Grafana Panels to monitor the metrics that matter

avg by(task_queue) (temporal_sticky_cache_size{}) Gives you sticky cache size which may or may not be shared across your workers, depending on the SDK`

avg by(task_queue) (temporal_worker_task_slots_available{worker_type="WorkflowWorker"}) Gives you avg workflow tasks slots available per worker

avg by(task_queue) (temporal_worker_task_slots_available{worker_type="ActivityWorker"}) Gives you avg activity tasks slots available per worker

avg(temporal_workflow_task_schedule_to_start_latency_seconds_count{}) Gives you schedule to start latency for workflow tasks

avg(temporal_activity_schedule_to_start_latency_seconds_count{}) Gives you schedule to start latency for activity tasks

100* avg((poll_success + poll_success_sync)/(poll_success+poll_success_sync+poll_timeouts)) Gives you poll success rate

Actions to be taken w.r.t metrics observed above

if worker resource consumption(CPU,RAM) is low and…

  • if sticky_cache_size hits workerCacheSize —> Increase worker cache size
  • if available_slots falls —> Increase slots per worker
  • if poll success rate AND schedule_to_start_latency both fall, —> You have too many workers
  • if available_slots is high AND schedule_to_start_latency is abnormally long and high(longpolling) —> Increase the poller count per worker This is rarely needed, and should be your last resort