Temporal seems to hit scale wall

cleverguy25 · April 21, 2022, 7:54pm

We have a large system we have been building in temporal over the last 5 months. Initial scale tests were ok but now as we scale out we seem to be hitting some bottlenecks… and I feel like I have tried everything. I have been through the How to tune workers with a fine tooth comb.

Our setup:
We have three workflows, with one task queue. Was that a mistake? Are the limits to taskqueue throughtput?
Most of the time one workflow executes quickly, with about 5-6 activities. Maybe 20% of the time there might be a lot of work to do so there could be up to 6K activities. If it is really big is spawns Child workflows with up to 6k activities each. All of these workflows endup sending signals to our actor style workflow, which processes them with 1 extra activity per signal. This only loops though 100 activities before doing ContinueAsNew.
I know DB is normally a bottleneck. We have a large AWS RDS Aurora DB (db.r6g.8xlarge), and can only seem to push it to 50-60%. On smaller instances we could spike to 80% when needed. We increased it to db.r6g.12xlarge. Now it wont break 35-40%, this has helped but not as much as we thought. The DB doesnt seem to be the bottleneck but there is a high amount of write IOPS.
We have 512 numHistoryShards. We are now trying more to see if this helps but this is harder to test.
We have the default MaxConcurrent* values. We tried increasing them but it didn’t help. We didn’t have temporal metrics wired up at first but now we can see that most of the time we have 95%+ worker_task_slots_available .
MaxConcurrentWorkflowTaskPollers / MaxConcurrentActivityTaskPollers helped to a point but if it went too high it had an adverse effect.
Our sticky_cache_size is the default, we have tried higher values but the cache hit ratio is not that high and didn’t help throughput.
Our workflow_task_schedule_to_start_latency and activity_schedule_to_start_latency just keep increasing. The new database made it stable but it is still very high.
Request latency also is high.
Poll Success Rate = ( poll_success + poll_success_sync ) / ( poll_success + poll_success_sync + poll_timeouts ) is 99%+.
We have tried bigger and more instances of temporal-history, temporal-matching, temporal-frontend pods. They are not hurting for resources.
We tried setting matching.numTaskqueueReadPartitions / matching.numTaskqueueWritePartitions in dynamic config from the default 4 to 10 and then we tried 20. This didn’t seem to help.
Overall we are seeing maybe 300 activities per second. Our workes are not at all saturated (20% CPU) but Temporal cannot seem to feed them work any faster.

Manuel_Ledesma · April 25, 2022, 2:35pm

We faced a similar issue and we switched to Cassandra instead of RDS and we got like a 30% increase in throughput, also make sure that Cassandra nodes have enough CPUs, another tunning was giving a lot of memory and CPU to each instance of history service, in our case, each instance is using 3 CPU/12GB, frontend and matching service are using 2 CPU/2GB. We set up Kubernetes HA, and your bottleneck should become C*, depending on your flow implementations. Our current production deployment can process 1500 flows/sec (Testing flow), we are using 6 C* nodes, and with 12 C* nodes, we can process 4000 flows/sec.

Your poller configuration is key, while load testing you can try different numbers until finding the right balance.

Karthick · April 25, 2022, 4:18pm

Thanks, I was following this post as we are facing similar issues.

Couple of questions please

How many activities per workflow in above case? We got 6 activities (all remote no local as of now) and not able to push a lot. But we are on RDS now. We could get an idea on activity throughput to compare against ours.
And the number of workflows per second above represents start of the workflow (or) able to start and finish these many workflows per second?

Manuel_Ledesma · April 25, 2022, 5:27pm

6 - 12 activities
Yes

tihomir · April 28, 2022, 1:13pm

We have 512 numHistoryShards . We are now trying more to see if this helps but this is harder to test.

You most likely have to increase history shard count, depending on your expected load.
Since numHistoryShards cannot be updated after a cluster is provisioned would recommend starting off with at least 4000.

Manuel_Ledesma · May 18, 2022, 12:53pm

True, we notice that any number above numHistoryShards: 8192, does not help much, which is the one we are using.

mle · March 29, 2024, 4:08pm

@Manuel_Ledesma how big were your Cassandra nodes?

Topic		Replies	Views
Suggestions to increase worker throughput Community Support	7	2065	December 10, 2020
Temporal throughput Community Support general-impl , best-practices	16	5803	January 20, 2025
Tuning Temporal setup for better performance Community Support cassandra , performance , kubernetes	5	9064	November 13, 2021
Temporal throughput not improving Community Support cassandra , metrics	2	1122	October 2, 2022
Running Temporal + Postgres - Benchmark Community Support java-sdk	7	6312	July 24, 2025

Temporal seems to hit scale wall

Related topics