Improving Temporal cluster performance

vishal · November 18, 2022, 6:50am

Hi team, we were trying to stress test our Temporal cluster with Maru. I followed the author’s advice over here. We observed that while running 40k workflows, our cluster experienced blockage being observed.

For context, this is our cluster setup. We’re running our Temporal cluster on AWS EKS on 2 node groups.
AWS EKS:
Node Group 1: 5 x c5n.xlarge(4 vCPUs, 10 GB ram) => Exclusively for history pods.
Node Group 2: 8 x t3a.medium(2 vCPUs, 4 GB ram) => For all other pods(Temporal + Prometheus + Grafana + Logging).
History Shards => 8192
History => 5 replicas
Matching => 4 replicas
Frontend => 4 replicas
Worker => 2 replicas
Scylla Cluster: 3 x i3en.xlarge => Yielded 20k write IOPS per node.

Throughout the tests, all our EC2 instances running Temporal services didn’t see CPU usages of more than 50%.

For the bench test,
We were running 8 x t3.large instances(2 vCPUs, 8 GB ram), and each instance had a worker running on it.

We were looking to understand the bottlenecks in our cluster and improve the throughput of the completed workflows. Could someone share insights on the metrics that we can use and understand to increase our Completed Workflow rate?

Thank you.

tihomir · November 21, 2022, 1:57pm

What’s the memory utilization look like on your history pods? Shards are distributed across your history pods and you typically don’t want more than 1K per history host, so would increase replicas if possible.

For server metrics to look into, would start with info in this forum post, note you also need to look at sdk metrics (see forum post here for more info) because performance tuning involves both your server as well as workers that you deploy and run your code.

Hope this gets you started in right direction.

Topic		Replies	Views
Temporal test bench by maru Community Support go-sdk , helm , metrics	3	756	December 22, 2022
Bottleneck at scaling Temporal server Community Support mysql , performance	1	75	March 11, 2025
Estimating the right configuration values of the temporal services Server Deployment	2	540	January 26, 2025
Recommendation for K8S Cluster; currently using default values Server Deployment helm , general-impl	11	1096	October 30, 2023
numHistoryShards cluster migration Community Support performance	12	1344	June 14, 2021

Improving Temporal cluster performance

Related topics