We have a large system we have been building in temporal over the last 5 months. Initial scale tests were ok but now as we scale out we seem to be hitting some bottlenecks… and I feel like I have tried everything. I have been through the How to tune workers with a fine tooth comb.
-
Our setup:
-
We have three workflows, with one task queue. Was that a mistake? Are the limits to taskqueue throughtput?
-
Most of the time one workflow executes quickly, with about 5-6 activities. Maybe 20% of the time there might be a lot of work to do so there could be up to 6K activities. If it is really big is spawns Child workflows with up to 6k activities each. All of these workflows endup sending signals to our actor style workflow, which processes them with 1 extra activity per signal. This only loops though 100 activities before doing ContinueAsNew.
-
I know DB is normally a bottleneck. We have a large AWS RDS Aurora DB (db.r6g.8xlarge), and can only seem to push it to 50-60%. On smaller instances we could spike to 80% when needed. We increased it to db.r6g.12xlarge. Now it wont break 35-40%, this has helped but not as much as we thought. The DB doesnt seem to be the bottleneck but there is a high amount of write IOPS.
-
We have 512
numHistoryShards
. We are now trying more to see if this helps but this is harder to test. -
We have the default
MaxConcurrent*
values. We tried increasing them but it didn’t help. We didn’t have temporal metrics wired up at first but now we can see that most of the time we have 95%+worker_task_slots_available
. -
MaxConcurrentWorkflowTaskPollers
/MaxConcurrentActivityTaskPollers
helped to a point but if it went too high it had an adverse effect. -
Our
sticky_cache_size
is the default, we have tried higher values but the cache hit ratio is not that high and didn’t help throughput. -
Our
workflow_task_schedule_to_start_latency
andactivity_schedule_to_start_latency
just keep increasing. The new database made it stable but it is still very high. -
Request latency also is high.
-
Poll Success Rate = (
poll_success
+poll_success_sync
) / (poll_success
+poll_success_sync
+poll_timeouts
) is 99%+. -
We have tried bigger and more instances of temporal-history, temporal-matching, temporal-frontend pods. They are not hurting for resources.
-
We tried setting
matching.numTaskqueueReadPartitions
/matching.numTaskqueueWritePartitions
in dynamic config from the default 4 to 10 and then we tried 20. This didn’t seem to help. -
Overall we are seeing maybe 300 activities per second. Our workes are not at all saturated (20% CPU) but Temporal cannot seem to feed them work any faster.