Too high schedule to start latency with fine metrics

Hi everyone, I’ve deployed Temporal cluster to a K8s with PostgreSQL, located on AWS.

Everything is fine with metrics, except schedule_to_start, what am I missing?

I have low persistence latency, low requests_latency, low history_latency, I even have just one workflow, but schedule_to_start with workflows and activities is too high, over 800ms in average and 1.5s max, there is no issues within lots of workflows (as I said, I just run one), there is no also problem in numHistory shards. My PostgreSQL seems fine, according to the AWS metrics.

Please, help me to find out the bottleneck, I saw sort of similar question already, but there is no answer to it.

I thought the problem is in PostgreSQL, because I’ve made 512 history shards (default value) in the database with only 2 RAM and 2 cores, but according to AWS, it seems fine.

Thank you! Looking for your answers or hints!

| but schedule_to_start with workflows and activities is too high, over 800ms in average and 1.5s max

from server metrics take a look at

sum(rate(persistence_requests{operation=“CreateTask”}[1m]))

and share please. Also show from sdk metrics your worker_task_slots_available, filter by worker_type.

If persistence requests for CreateTask are low and your task slots available for WorkflowWorker/ActivityWorker are not depleted (go to 0 when your schedule to start latencies are high)
then consider upping workflow and activity task pollers from default 2/2 in Go to possibly 10/10 and see if that makes a difference.

Unfortunately, this didn’t help. I assume, the problem is in high latency between the Temporal cluster and the persistence as it is located on different clouds. I tried to move them to the one cloud and the problem seems to be disappeared. Also I move them to the one VPC and subnet. But is this a problem indeed? Does Temporal do too much transaction to a persistence, so in the sum, overall latency is too high? Which metrics can I expose to proof it? I need something like the network latency between the Temporal cluster and the persistence or the amount of requests to a persistence made to the persistence to complete the workflow.

You can look at service metrics:

  1. persistence latencies

histogram_quantile(0.95, sum(rate(persistence_latency_bucket{}[1m])) by (operation, le))

  1. service latencies

histogram_quantile(0.95, sum(rate(service_latency_bucket{service_type=“frontend”}[5m])) by (operation, le))

(change between different service_type)

  1. for use case would look at

sum(rate(persistence_requests{operation=“CreateTask”}[1m]))

which can give you an indication of backlog (number of tasks that needed to be persisted to db and could not be dispatched to workers right away)

would also look at your resource exhausted graphs:

sum(rate(service_errors_resource_exhausted{}[1m])) by (operation, resource_exhausted_cause)

to see if maybe you are hitting any qps limits

I have tried histogram_quantile(0.95, sum(rate(persistence_latency_bucket{}[1m])) by (operation, le)), but it’s strange, it always display 1s (or 1000 as a number) or I could not understand, how this metrics exactly works. The same thing with service metrics, also there are no errors at the last metric.

So could it just be the problem with the fact, they are in the different cluster and subnet? Also, does Temporal make huge number of persistence requests to complete a workflow? I mean, it’s not like 1 or 2, but more than 10 or even 100?