Workflow Task Schedule To Start Latency High

Hi,

Following is my temporal setup.

  1. historyShards: 12000
  2. numTaskqueueWrite(/read)Partitions: 16
  3. worker options:
  • maxConcurrentWorkflowTaskExecutionSize: 200,
  • maxConcurrentActivityExecutionSize: 200,
  • maxConcurrentWorkflowTaskPollers: 5,
  • maxConcurrentActivityTaskPollers: 5,
  • maxConcurrentLocalActivityExecutionSize: 100

4: worker factory option:

  • maxWorkflowThreadCount: 500

5: temporal version:

  • temporalServerVersion: v1.15.2
  • temporalWebVersion: v1.14.0

6: replicas:

  • web: 3
  • frontend: 22
  • history: 45
  • matching: 22
  • worker: 8

My test workflow consist of 5-10 activities which includes makes async call to other service, interact with db and some timer. Following the instruction here to tune worker performance but still have very high Workflow Task Schedule To Start Latency. The metric I used is “temporal_workflow_task_schedule_to_start_latency.95percentile”.

Following are the metric graphs captured for our load:

From graphs above, we can see even during no traffic period (between 16:05-16:15 -ish), the workflow_task_schedule_to_start_latency is still not going down. And it causes timeout exception during testing from time to time.

What could be a probable reason behind this and potential tuning/adjustment I can try out?

Thanks!

Hi, can you look at async match rate:

sum(rate(poll_success{}[1m])) - sum(rate(poll_success_sync{}[1m]))

it should stay at zero. Also try increasing your poller count per worker pod (maybe try 20, 40) and see if this makes a difference.

* maxWorkflowThreadCount: 500

is there reason to set this to 500, default is 600

Thanks Tihomir for the reply!

Poll success metric:

I pulled the metric as you mentioned (please help check if this is the correct one) and seems the delta is not close to 0. What could this indicate?

btw, I saw another post mentioned the number of (poll_success + poll_success_sync) / (poll_success + poll_success_sync + poll_timeouts) should close to 1 and that is true for my use case.

  • maxWorkflowThreadCount: 500

We are searching around for similar issue and found this thread mentioned reducing thread count may be helpful if having too many workers. So, I just give it a try. I also tried to increase to 2000 but seems no luck as well.

And update: I changed the activity startToClose timeout from a large number to 5 mins since I posted this question but no difference.

Regarding the metrics:
poll_success_sync is recorded when a workflow task is dispatched without using the matching service db (so all successfull polls that were able to bypass db io cause workers were up to pick it up)
poll_success: when a workflow task is dispatched to worker (so all successfull polls for workflow tasks)

if there are no workers available, matching service has to persist the task and later when worker comes back up (or has capacity to pick it up) has to read it to deliver it to the worker, this adds latencies.

So
sum(rate(poll_success{}[1m])) - sum(rate(poll_success_sync{}[1m]))

should be close to 0 if possible.

I think tho it’s better to rely on percentages, so you can also use:

sum(rate(poll_success_sync{}[1m])) / sum(rate(poll_success{}[1m]))

which in this case should be close to 95%+, ideally 99%

have very high Workflow Task Schedule To Start Latency

Is it same for temporal_activity_task_schedule_to_start_latency (these are both SDK metrics)
This is indication in addition to your sync match rate that need to add more workers and see to increase worker capacity i think.

yes, that is same for temporal_activity_task_schedule_to_start_latency. Before adding more workers, do you think I should keep tuning worker options? Or do you see ay problem with the params I set up for my use case?

I think you can try increasing the poller count as suggested earlier to see if that would help (and maybe leave the thread count to default 600 as well)

got it. by “poller count”, do you mean maxConcurrentWorkflowTaskPollers and maxConcurrentActivityTaskPollers?

got it. by “poller count”, do you mean maxConcurrentWorkflowTaskPollers and maxConcurrentActivityTaskPollers?

yes