Troubles shoot of workflow execution latency

Hi all,

I implemented two simple workflows: each contains one simple activity, which completes in tens of milliseconds. They runs within the same worker and on the same task queue.

In my performance test, I trigger them through my rest API, using Jmeter with 50 threads.
In this case, I observe that one of the two workflows has a latency more than 1.5s from submit to execution of workflow method. And the other one seems work fine with a much smaller latency.
My questions are:

  1. Any possible explanation of the workflow execution latency?
  2. Any metrics can help to analyze this problem?

I see there are many metrics exposed in prometheus. I try to monitor temporal_workflow_task_schedule_to_start_latency_sum, but only a series with name space temporal_system is found. While My workflow are submitted to namespace default. Now I am stuck with how to dig further.

I see there are many metrics exposed in prometheus. I try to monitor temporal_workflow_task_schedule_to_start_latency_sum, but only a series with name space temporal_system is found

I assume here you are looking at server metrics, not SDK metrics. Temporal server is going to emit SDK m metrics for internally running workflows (on the internal temporal system namespace). In order to monitor schedule to start latencies for your workflows you need to set up and query SDK metrics.

  1. Any possible explanation of the workflow execution latency?

From the worker side see worker tuning guide and definitely measure the schedule to start latencies(temporal_workflow_task_schedule_to_start_latency) as you mentioned.
On server side would be good to check asyncmatch_latency metric, it measures async matched tasks from the time they are created to delivered. The larger this latency the longer tasks are sitting in the queue waiting for your workers to pick them up.
Hope this gets your stated in right direction.

1 Like