Improving Temporal cluster performance

Hi team, we were trying to stress test our Temporal cluster with Maru. I followed the author’s advice over here. We observed that while running 40k workflows, our cluster experienced blockage being observed.

For context, this is our cluster setup. We’re running our Temporal cluster on AWS EKS on 2 node groups.
Node Group 1: 5 x c5n.xlarge(4 vCPUs, 10 GB ram) => Exclusively for history pods.
Node Group 2: 8 x t3a.medium(2 vCPUs, 4 GB ram) => For all other pods(Temporal + Prometheus + Grafana + Logging).
History Shards => 8192
History => 5 replicas
Matching => 4 replicas
Frontend => 4 replicas
Worker => 2 replicas
Scylla Cluster: 3 x i3en.xlarge => Yielded 20k write IOPS per node.

Throughout the tests, all our EC2 instances running Temporal services didn’t see CPU usages of more than 50%.

For the bench test,
We were running 8 x t3.large instances(2 vCPUs, 8 GB ram), and each instance had a worker running on it.

We were looking to understand the bottlenecks in our cluster and improve the throughput of the completed workflows. Could someone share insights on the metrics that we can use and understand to increase our Completed Workflow rate?

Thank you.

What’s the memory utilization look like on your history pods? Shards are distributed across your history pods and you typically don’t want more than 1K per history host, so would increase replicas if possible.

For server metrics to look into, would start with info in this forum post, note you also need to look at sdk metrics (see forum post here for more info) because performance tuning involves both your server as well as workers that you deploy and run your code.

Hope this gets you started in right direction.