We are using TypeScript SDK and we are able to scrap Prometheus metrics.
I am running one workflow with 5 activities. When I run this, this gets executed in 3.335 seconds.
Now, I am generating load using Grafana K6 and I am load-testing this workflow.
Total 2105 execution get spawned. Out of that 1744 are successful and rest are failure.
Time taken:
820 executions completed in less than a minute
333 executions took 1 to 2 minutes
304 executions took 2 to 3 minutes
249 executions took 3 to 4 minutes
38 executions took more than 4 minutes
Reasons for failure: [I have setup workflowTaskTimeout and Activity startToCloseTimeout to 1 minute)
WorkflowTaskTimeOut
ActivityTaskTimedOut
We have only 1 task-queue, from which Worker is polling.
I am using default worker options. I have not changed anything.
Having Observations from Temporal UI suggests that in the workflows that are taking too much time, there is so much delay between transitions like ActivityTaskScheduled to ActivityTaskStarted and WorkflowTaskScheduled to WorkflowTaskStarted
My Questions:
Why it is taking too much time. All the executions should have been completed successfully in less than 1 minute.
How to fix the timeout of the activity task and workflow task?
What changes should I be making to the Worker options?
Should I make any change to the task-queue? Can the task-queue getting flooded be the reason of slow executions?
Can you describe your use case for load test? Basically what do your workflows that you are running do? Is it indicative or close to the real use case you are intending to run on your cluster?
persistence latency seems pretty high. should be in 100s of ms typically, might need to look at size of your db
sync match rate: looks like need to look at your workers next after we are done with server side
workflow lock contention - seems pretty high as in your single executions seem to have a lot of updates (signals? async activities/child wf completing at same time? timers firing at same time? combination of all mentioned?)
shard lock contention - whats numHistoryShards you set in static config? seems we need to try to increase that, how many history hosts are you running? i would first try to look into this and the persistence latencies to try to reduce those first and them move on to next steps
(note if you change numHistoryShards you need to stand up a new cluster, including persistence store (need to re-index))
resource exhausted on BusyWorkflow, this is related to your high workflow lock contention, meaning you are probably starting too many activities/child workflows from single workflow execution or your activities might be all heartbeating at very high rate. typical recommendation could be to start less activities/child workflows concurrently or if issue is heartbeats up the heartbeat timeout by some small value and test again.
Our use-case for load-test is, we want to see if similar type of dynamic workflows(different and multiple activities) created by end users will get executed efficiently at that load or not. We won’t have dedicated workers for a single workflow. I expect multiple workflows getting executed by a single worker. We will have M Workflow and it can have N number of executions. I want to generate maximum traffic using the load-test so I can check how Temporal will hold up when there are multiple users who are executing multiple workflows.
Here’s my observation, might help:
When I run the load-test, initial executions are completed within split seconds like (3-4 seconds) but next executions take too much time(maybe workers are busy)
If I am not wrong, are you talking about the Persistence Postgres DB Instance where temporal and temporal_visibility dbs are created? then please check the following SS:
We don’t have child workflows or signals as of now, but we do have a number of async activities inside a workflow.
We have numHistoryShards set to 512.
Here is our services configuration:
History: Replica: 5, CPU: 2000m, Memory: 8 Gi Worker: Replica: 3, CPU: 500m, Memory: 1 Gi Frontend: Replica: 3, CPU: 500m, Memory: 1 Gi Matching: Replica: 3, CPU: 500m, Memory: 1 Gi
We are not using child workflows. We are not yet using the activity heartbeat, as I expect the activities to complete in 2-3 seconds.
I have checked minimal-values.yaml file, in which I found the below configurations for my persistence postgres db temporal and temporal-visibility. Does it have anything for slow performance? Should I increase it?