Bottleneck at scaling Temporal server

Ethan · March 7, 2025, 7:03am

Hello Community,

Recently we built a cluster with temporal/server-1.24.3. And we’ve been improving performance of it but we seems to hit a wall. So I am reaching out to see if we can get some suggestions from the community. Thank you!

Computing & Storage Spec
We self-hosted our temporal server in a k8s cluster provided by Alibaba Cloud.
Temporal Frontend: 8Core CPU&16G Mem * 3 pods
Temporal History (numHistoryShard = 512): 8Core CPU&16G Mem * 6 pods
Temporal Matching: 8Core CPU&16G Mem * 6 pods
Temporal Worker: 8Core CPU&16G Mem * 3 pods

We are using MySQL as persistence store and ElasticSearch as visibility store.
MySQL: 16Core CPU & 32G Mem, 1 instance
ES：2Core CPU & 4G Mem * 6 instances

Bottleneck we encounter
We run 100k workflow almost at the same time from 1 application worker. Each workflow has 7 activities. We have 3 application instances as poller and worker. Currently it takes 11~15 mins to complete all 100k workflows. All the activities of each workflow are just some RPC costing less than 1 second. Now we can not get any further in improving the performance and we appreciate your suggestions.

Here is the grafana of a 100k-workflow run.

We have realized that our schedule-to-start latency and sync-match rate may not be ideal. What the root causes might be? Thank you.

What we plan to do
We are actually wondering if numHistoryShard = 512 is too small for our workload. Should we change it to 4k or 8k or even larger? Thanks!

Ethan · March 11, 2025, 9:38am

Hi community, we have changed the numHistoryShards from 512 to 8192 but seems like the state transfer does not improve significantly. Does anyone have experience on it?

Topic		Replies	Views
Temporal seems to hit scale wall Community Support performance	6	3421	March 29, 2024
Throughput and scaling of temporal workers Server Deployment go-sdk	0	78	December 17, 2024
Suggestions to increase worker throughput Community Support	7	2034	December 10, 2020
Temporal studying - various questions Community Support	5	1489	February 9, 2021
Improving Temporal cluster performance Server Deployment go-sdk , aws , scylla , kubernetes	1	1005	November 21, 2022

Bottleneck at scaling Temporal server

Related topics