Temporal seems to hit scale wall

We have a large system we have been building in temporal over the last 5 months. Initial scale tests were ok but now as we scale out we seem to be hitting some bottlenecks… and I feel like I have tried everything. I have been through the How to tune workers with a fine tooth comb.

  • Our setup:

  • We have three workflows, with one task queue. Was that a mistake? Are the limits to taskqueue throughtput?

  • Most of the time one workflow executes quickly, with about 5-6 activities. Maybe 20% of the time there might be a lot of work to do so there could be up to 6K activities. If it is really big is spawns Child workflows with up to 6k activities each. All of these workflows endup sending signals to our actor style workflow, which processes them with 1 extra activity per signal. This only loops though 100 activities before doing ContinueAsNew.

  • I know DB is normally a bottleneck. We have a large AWS RDS Aurora DB (db.r6g.8xlarge), and can only seem to push it to 50-60%. On smaller instances we could spike to 80% when needed. We increased it to db.r6g.12xlarge. Now it wont break 35-40%, this has helped but not as much as we thought. The DB doesnt seem to be the bottleneck but there is a high amount of write IOPS.

  • We have 512 numHistoryShards. We are now trying more to see if this helps but this is harder to test.

  • We have the default MaxConcurrent* values. We tried increasing them but it didn’t help. We didn’t have temporal metrics wired up at first but now we can see that most of the time we have 95%+ worker_task_slots_available .

  • MaxConcurrentWorkflowTaskPollers / MaxConcurrentActivityTaskPollers helped to a point but if it went too high it had an adverse effect.

  • Our sticky_cache_size is the default, we have tried higher values but the cache hit ratio is not that high and didn’t help throughput.

  • Our workflow_task_schedule_to_start_latency and activity_schedule_to_start_latency just keep increasing. The new database made it stable but it is still very high.

  • Request latency also is high.

  • Poll Success Rate = ( poll_success + poll_success_sync ) / ( poll_success + poll_success_sync + poll_timeouts ) is 99%+.

  • We have tried bigger and more instances of temporal-history, temporal-matching, temporal-frontend pods. They are not hurting for resources.

  • We tried setting matching.numTaskqueueReadPartitions / matching.numTaskqueueWritePartitions in dynamic config from the default 4 to 10 and then we tried 20. This didn’t seem to help.

  • Overall we are seeing maybe 300 activities per second. Our workes are not at all saturated (20% CPU) but Temporal cannot seem to feed them work any faster.


We faced a similar issue and we switched to Cassandra instead of RDS and we got like a 30% increase in throughput, also make sure that Cassandra nodes have enough CPUs, another tunning was giving a lot of memory and CPU to each instance of history service, in our case, each instance is using 3 CPU/12GB, frontend and matching service are using 2 CPU/2GB. We set up Kubernetes HA, and your bottleneck should become C*, depending on your flow implementations. Our current production deployment can process 1500 flows/sec (Testing flow), we are using 6 C* nodes, and with 12 C* nodes, we can process 4000 flows/sec.

Your poller configuration is key, while load testing you can try different numbers until finding the right balance.

Thanks, I was following this post as we are facing similar issues.

Couple of questions please

  1. How many activities per workflow in above case? We got 6 activities (all remote no local as of now) and not able to push a lot. But we are on RDS now. We could get an idea on activity throughput to compare against ours.

  2. And the number of workflows per second above represents start of the workflow (or) able to start and finish these many workflows per second?

  1. 6 - 12 activities
  2. Yes
  • We have 512 numHistoryShards . We are now trying more to see if this helps but this is harder to test.

You most likely have to increase history shard count, depending on your expected load.
Since numHistoryShards cannot be updated after a cluster is provisioned would recommend starting off with at least 4000.

True, we notice that any number above numHistoryShards: 8192, does not help much, which is the one we are using.