Hello Community,
Recently we built a cluster with temporal/server-1.24.3. And we’ve been improving performance of it but we seems to hit a wall. So I am reaching out to see if we can get some suggestions from the community. Thank you!
Computing & Storage Spec
We self-hosted our temporal server in a k8s cluster provided by Alibaba Cloud.
Temporal Frontend: 8Core CPU&16G Mem * 3 pods
Temporal History (numHistoryShard = 512): 8Core CPU&16G Mem * 6 pods
Temporal Matching: 8Core CPU&16G Mem * 6 pods
Temporal Worker: 8Core CPU&16G Mem * 3 pods
We are using MySQL as persistence store and ElasticSearch as visibility store.
MySQL: 16Core CPU & 32G Mem, 1 instance
ES:2Core CPU & 4G Mem * 6 instances
Bottleneck we encounter
We run 100k workflow almost at the same time from 1 application worker. Each workflow has 7 activities. We have 3 application instances as poller and worker. Currently it takes 11~15 mins to complete all 100k workflows. All the activities of each workflow are just some RPC costing less than 1 second. Now we can not get any further in improving the performance and we appreciate your suggestions.
Here is the grafana of a 100k-workflow run.
We have realized that our schedule-to-start latency and sync-match rate may not be ideal. What the root causes might be? Thank you.
What we plan to do
We are actually wondering if numHistoryShard = 512 is too small for our workload. Should we change it to 4k or 8k or even larger? Thanks!