Too high schedule to start latency with fine metrics

dmbibishkin · March 22, 2024, 1:31pm

Hi everyone, I’ve deployed Temporal cluster to a K8s with PostgreSQL, located on AWS.

Everything is fine with metrics, except schedule_to_start, what am I missing?

I have low persistence latency, low requests_latency, low history_latency, I even have just one workflow, but schedule_to_start with workflows and activities is too high, over 800ms in average and 1.5s max, there is no issues within lots of workflows (as I said, I just run one), there is no also problem in numHistory shards. My PostgreSQL seems fine, according to the AWS metrics.

Please, help me to find out the bottleneck, I saw sort of similar question already, but there is no answer to it.

I thought the problem is in PostgreSQL, because I’ve made 512 history shards (default value) in the database with only 2 RAM and 2 cores, but according to AWS, it seems fine.

Thank you! Looking for your answers or hints!

tihomir · March 25, 2024, 7:44pm

| but schedule_to_start with workflows and activities is too high, over 800ms in average and 1.5s max

from server metrics take a look at

sum(rate(persistence_requests{operation=“CreateTask”}[1m]))

and share please. Also show from sdk metrics your worker_task_slots_available, filter by worker_type.

If persistence requests for CreateTask are low and your task slots available for WorkflowWorker/ActivityWorker are not depleted (go to 0 when your schedule to start latencies are high)
then consider upping workflow and activity task pollers from default 2/2 in Go to possibly 10/10 and see if that makes a difference.

dmbibishkin · March 26, 2024, 2:16pm

Unfortunately, this didn’t help. I assume, the problem is in high latency between the Temporal cluster and the persistence as it is located on different clouds. I tried to move them to the one cloud and the problem seems to be disappeared. Also I move them to the one VPC and subnet. But is this a problem indeed? Does Temporal do too much transaction to a persistence, so in the sum, overall latency is too high? Which metrics can I expose to proof it? I need something like the network latency between the Temporal cluster and the persistence or the amount of requests to a persistence made to the persistence to complete the workflow.

tihomir · March 26, 2024, 2:33pm

You can look at service metrics:

persistence latencies

histogram_quantile(0.95, sum(rate(persistence_latency_bucket{}[1m])) by (operation, le))

service latencies

histogram_quantile(0.95, sum(rate(service_latency_bucket{service_type=“frontend”}[5m])) by (operation, le))

(change between different service_type)

for use case would look at

sum(rate(persistence_requests{operation=“CreateTask”}[1m]))

which can give you an indication of backlog (number of tasks that needed to be persisted to db and could not be dispatched to workers right away)

would also look at your resource exhausted graphs:

sum(rate(service_errors_resource_exhausted{}[1m])) by (operation, resource_exhausted_cause)

to see if maybe you are hitting any qps limits

dmbibishkin · March 26, 2024, 2:58pm

I have tried histogram_quantile(0.95, sum(rate(persistence_latency_bucket{}[1m])) by (operation, le)), but it’s strange, it always display 1s (or 1000 as a number) or I could not understand, how this metrics exactly works. The same thing with service metrics, also there are no errors at the last metric.

So could it just be the problem with the fact, they are in the different cluster and subnet? Also, does Temporal make huge number of persistence requests to complete a workflow? I mean, it’s not like 1 or 2, but more than 10 or even 100?

Topic		Replies	Views
Single workflow schedule-to-start latency is within low hundreds of milliseconds Community Support go-sdk , matching-service , latency	1	216	July 10, 2024
Long gap between activities Community Support go-sdk , helm	1	229	March 25, 2024
Workflow Task Schedule To Start Latency High Community Support java-sdk , deployment	11	3934	February 8, 2025
High latency on Starting Workflow/Activity execution Community Support java-sdk , metrics , latency	7	1549	March 15, 2023
Very high Workflow Task Schedule To Start Latency Community Support	0	228	July 13, 2024

Too high schedule to start latency with fine metrics

Related topics