How To Identify And Tune Worker Bottlenecks

Hi,

I am building my first ever temporal workflow using the java-sdk. My workflow is going to consist of 4 activities across 2 different services (service A and service B).
service A will kick off the workflow executing Activity 1, Activity 2, Activity 3 (all are blocking) by its worker listening on task Queue A. Activity 4 is going to be executed by the worker in Service B listening on its own task queue B.
Originally, activity 4 was going to be a RPC call from service A to service B but given that it is anti pattern, decided to make it an activity in service B.
Now my question is, how do I monitor the performance of the worker and its task queue B in service B. I want to make sure that activity 4 is executed almost like a real time synchronous RPC call. I’d like to spot any latency issues in polling activity 4 by worker in Service B and accordingly tune it. If you could tell me which exact metrics I need to observe on that will be really helpful.

I initially plan to have just one worker in both the services as I don’t expect to spawn more than 10 workflow execution instances per second.

I’ve used the above guides to tune worker performance.

Grafana Panels to monitor the metrics that matter

avg by(task_queue) (temporal_sticky_cache_size{}) Gives you sticky cache size which may or may not be shared across your workers, depending on the SDK`

avg by(task_queue) (temporal_worker_task_slots_available{worker_type="WorkflowWorker"}) Gives you avg workflow tasks slots available per worker

avg by(task_queue) (temporal_worker_task_slots_available{worker_type="ActivityWorker"}) Gives you avg activity tasks slots available per worker

avg(temporal_workflow_task_schedule_to_start_latency_seconds_count{}) Gives you schedule to start latency for workflow tasks

avg(temporal_activity_schedule_to_start_latency_seconds_count{}) Gives you schedule to start latency for activity tasks

100* avg((poll_success + poll_success_sync)/(poll_success+poll_success_sync+poll_timeouts)) Gives you poll success rate

Actions to be taken w.r.t metrics observed above

if worker resource consumption(CPU,RAM) is low and…

  • if sticky_cache_size hits workerCacheSize —> Increase worker cache size
  • if available_slots falls —> Increase slots per worker
  • if poll success rate AND schedule_to_start_latency both fall, —> You have too many workers
  • if available_slots is high AND schedule_to_start_latency is abnormally long and high(longpolling) —> Increase the poller count per worker This is rarely needed, and should be your last resort

thank you. I really appreciate this.