How To Identify And Tune Worker Bottlenecks

sriramg · January 21, 2023, 12:26am

Hi,

I am building my first ever temporal workflow using the java-sdk. My workflow is going to consist of 4 activities across 2 different services (service A and service B).
service A will kick off the workflow executing Activity 1, Activity 2, Activity 3 (all are blocking) by its worker listening on task Queue A. Activity 4 is going to be executed by the worker in Service B listening on its own task queue B.
Originally, activity 4 was going to be a RPC call from service A to service B but given that it is anti pattern, decided to make it an activity in service B.
Now my question is, how do I monitor the performance of the worker and its task queue B in service B. I want to make sure that activity 4 is executed almost like a real time synchronous RPC call. I’d like to spot any latency issues in polling activity 4 by worker in Service B and accordingly tune it. If you could tell me which exact metrics I need to observe on that will be really helpful.

I initially plan to have just one worker in both the services as I don’t expect to spawn more than 10 workflow execution instances per second.

Dhiraj_Bhakta · January 21, 2023, 9:05am

I’ve used the above guides to tune worker performance.

Grafana Panels to monitor the metrics that matter

avg by(task_queue) (temporal_sticky_cache_size{}) Gives you sticky cache size which may or may not be shared across your workers, depending on the SDK`

avg by(task_queue) (temporal_worker_task_slots_available{worker_type="WorkflowWorker"}) Gives you avg workflow tasks slots available per worker

avg by(task_queue) (temporal_worker_task_slots_available{worker_type="ActivityWorker"}) Gives you avg activity tasks slots available per worker

avg(temporal_workflow_task_schedule_to_start_latency_seconds_count{}) Gives you schedule to start latency for workflow tasks

avg(temporal_activity_schedule_to_start_latency_seconds_count{}) Gives you schedule to start latency for activity tasks

100* avg((poll_success + poll_success_sync)/(poll_success+poll_success_sync+poll_timeouts)) Gives you poll success rate

Actions to be taken w.r.t metrics observed above

if worker resource consumption(CPU,RAM) is low and…

if sticky_cache_size hits workerCacheSize —> Increase worker cache size
if available_slots falls —> Increase slots per worker
if poll success rate AND schedule_to_start_latency both fall, —> You have too many workers
if available_slots is high AND schedule_to_start_latency is abnormally long and high(longpolling) —> Increase the poller count per worker This is rarely needed, and should be your last resort

sriramg · January 23, 2023, 11:08pm

thank you. I really appreciate this.

Topic		Replies	Views
What are the recommended settings for workflow and activity pollers count? Developer Corner general-impl	0	4581	August 8, 2022
Workflow Performance with Java SDK Community Support java-sdk	1	742	February 20, 2023
Tuning Temporal setup for better performance Community Support cassandra , performance , kubernetes	5	8848	November 13, 2021
Clarification on metrics (client + server) Community Support java-sdk , metrics	14	2637	April 13, 2022
Cannot figure out how to reduce the activity start latency Community Support	1	33	March 6, 2025

How To Identify And Tune Worker Bottlenecks

Grafana Panels to monitor the metrics that matter

Actions to be taken w.r.t metrics observed above

Related topics