Hi team, we were trying to stress test our Temporal cluster with Maru. I followed the author’s advice over here. We observed that while running 40k workflows, our cluster experienced blockage being observed.
For context, this is our cluster setup. We’re running our Temporal cluster on AWS EKS on 2 node groups.
Node Group 1: 5 x c5n.xlarge(4 vCPUs, 10 GB ram) => Exclusively for history pods.
Node Group 2: 8 x t3a.medium(2 vCPUs, 4 GB ram) => For all other pods(Temporal + Prometheus + Grafana + Logging).
History Shards => 8192
History => 5 replicas
Matching => 4 replicas
Frontend => 4 replicas
Worker => 2 replicas
Scylla Cluster: 3 x i3en.xlarge => Yielded 20k write IOPS per node.
Throughout the tests, all our EC2 instances running Temporal services didn’t see CPU usages of more than 50%.
For the bench test,
We were running 8 x t3.large instances(2 vCPUs, 8 GB ram), and each instance had a worker running on it.
We were looking to understand the bottlenecks in our cluster and improve the throughput of the completed workflows. Could someone share insights on the metrics that we can use and understand to increase our Completed Workflow rate?