I am currently doing stress testing on our application which uses Temporal for micro-service orchestration.
I have some application which have one workflow which contains one activity and one async child workflow (which contains one activity).
My worker configured with options:
Temporal server started like there with PostgreSQL.
I start 50 threads that send a request to create a workflow for 10 minutes and have high latency StartWorkflowExecution. As result I have low request throughput.
Seems the latencies revolve around starting your workflow executions and responding to task completions. In your load test how many workflows/activities are you starting simultaneously? How many worker processes do you have?
Couple of things to check:
Check service_errors_resource_exhausted to see if you are hitting any rps limits (also look at your frontend logs)
Check task_latency_userlatency metric which are latencies that can be introduced by your app code and if high could cause workflow lock contention (this could happen if you start a large number of activities or workflows / child workflows simultaneously).
Along with userlatency check cache latency, which if high can also indicate you might be starting too many workflows/activities simultaneously: histogram_quantile(0.99, sum(rate(cache_latency_bucket{operation="HistoryCacheGetOrCreate"}[1m])) by (le))
Finally could also check if you may need to increase the number of shards: histogram_quantile(0.99, sum(rate(lock_latency_bucket{operation="ShardInfo"}[1m])) by (le))
You would need to rebuild your cluster (and wipe data) each time you change number of shards.
@tihomir how does one bring down the task_latency_userlatency in cases where starting large number of activities/workflows simultaneously is indeed a requirement?
can be introduced by your app code
We are spawning 100s of workflows from inside a single activity of a parent workflow, and we have 10s of such activities to concurrently start 1000s of workflows at once. Is this a wrong pattern? Note that these arenβt child workflows
What is the best way to spawn 1000s of sub-workflows from inside a workflow if not via activities?
It is OK to spawn them from activities. But I would recommend adding some rate limiting to these start calls. Instead of starting all of them at once, start some fixed number of them per second.