Bottleneck at scaling Temporal server

Hello Community,

Recently we built a cluster with temporal/server-1.24.3. And we’ve been improving performance of it but we seems to hit a wall. So I am reaching out to see if we can get some suggestions from the community. Thank you!

Computing & Storage Spec
We self-hosted our temporal server in a k8s cluster provided by Alibaba Cloud.
Temporal Frontend: 8Core CPU&16G Mem * 3 pods
Temporal History (numHistoryShard = 512): 8Core CPU&16G Mem * 6 pods
Temporal Matching: 8Core CPU&16G Mem * 6 pods
Temporal Worker: 8Core CPU&16G Mem * 3 pods

We are using MySQL as persistence store and ElasticSearch as visibility store.
MySQL: 16Core CPU & 32G Mem, 1 instance
ES:2Core CPU & 4G Mem * 6 instances

Bottleneck we encounter
We run 100k workflow almost at the same time from 1 application worker. Each workflow has 7 activities. We have 3 application instances as poller and worker. Currently it takes 11~15 mins to complete all 100k workflows. All the activities of each workflow are just some RPC costing less than 1 second. Now we can not get any further in improving the performance and we appreciate your suggestions.

Here is the grafana of a 100k-workflow run.





We have realized that our schedule-to-start latency and sync-match rate may not be ideal. What the root causes might be? Thank you.


What we plan to do
We are actually wondering if numHistoryShard = 512 is too small for our workload. Should we change it to 4k or 8k or even larger? Thanks!

Hi community, we have changed the numHistoryShards from 512 to 8192 but seems like the state transfer does not improve significantly. Does anyone have experience on it?