High schedule to start latency

Deployment Architecture:
Worker processes: 14 pods with 13 GB memory/7 CPU cores
Temporal frontend: 8 pods with 1 GB memory/1 CPU cores
Temporal history: 25 pods with 2 GB memory/3 CPU cores
Temporal matching: 17 pods with 2 GB memory/3 CPU cores
Temporal workers: 20 pods with 2 GB memory/3 CPU cores.

We have configured workers with the following values:
MAX_CONCURRENT_ACTIVITY_POLLERS = 20
MAX_CONCURRENT_WORKFLOW_TASK_POLLERS = 20
MAX_WORKFLOW_THREAD_COUNT = 5000
MAX_CONCURRENT_ACTIVITY_EXECUTION_SIZE = 600
WORKFLOW_CACHE_SIZE = 5000
MAX_CONCURRENT_LOCAL_ACTIVITY_EXECUTION_SIZE = 600
MAX_CONCURRENT_WORKFLOW_TASK_EXECUTION_SIZE = 600

Problem:
There is a high schedule_to_start_latency when we have around 200 workflows getting created per second.
From metrics, we observe that a high number of task_slots are available.

What could be a probable reason behind this?

For everyone’s reference, some useful articles for the same:

for task queue partitions, can be set via “matching.numTaskqueueWritePartitions” and “matching.numTaskqueueReadPartitions” in dynamic config.
Dynamic config knobs on server side: read write
default is 4 and that should do the work, if not try setting to 8.

2 Likes