Tuning Temporal setup for better performance

Our team currently evaluates whether Temporal is suitable for our needs. I’ve performed a simple load testing on our test environment.
We’re using Temporal version 0.28 deployed from the helm chart to the GKE on n1-standard hosts.

To simulate a load, I use a slightly modified HelloActivity from the java-samples repository. I’m pushing workflow executions at different rates and monitoring how much Temporal can handle without an increase in latency.

With the default setup from the helm-chart, I got a pretty stable value of 20 workflow executions per second. Then, after adding extra memory to Cassandra nodes and scaling up Temporal services, I’ve managed to achieve 50 executions per second, but no more than that. Scaling Cassandra to 6 nodes had no visible effect.

Can you please give some advice on how to improve the performance further? Which Cassandra set up should I use? How to look up for possible bottlenecks? Is it possible to tune Temporal services somehow?

Thanks!

The first expected bottleneck is a single task queue throughput. Make sure that the task queues used by your example have enough partitions as they are not autoscaled yet. These are configured through dynamic config.

Also Java client might need number of polling threads adjusted to increase throughput. For workflow task list adjust WorkerFactoryOptions.workflowHostLocalPollThreadCount. And for activity task list adjust WorkerOptions.activityPollThreadCount.

@maxim
Q1: For Go clients do we need to the poller count?
I can see the following poller configurations:

MaxConcurrentActivityTaskPollers
and
MaxConcurrentDecisionTaskPollers

The default value for these are 2.

Q2: What would you recommend for each of these at 50, 100, 200, 300, 400, 500 tps?

Q3: Any drawbacks of keeping this number very high?

Q4: What should be kept as the task list partition count? Somewhere I read you recommending 15 partitions for 1k tps.

Q5: Any drawbacks of keeping task list partition count high?

Q6: Any other factor we may need to tweak for perf using Go client?

The desired poller count depends on the latency from worker to the service. Higher the latency lower the throughput of a single poller thread. If the number of poller processes is small you can try increasing the poller count to 5-10 pollers.

Q3: Any drawbacks of keeping this number very high?

The high number will not increase performance but put additional load on the service especially if the number of worker processes is high.

Q4: What should be kept as the task list partition count? Somewhere I read you recommending 15 partitions for 1k tps.

I would allocate one partition for 50-80 tasks per second depending on DB. So 15-20 for 1k task per second sounds reasonable.

Q5: Any drawbacks of keeping task list partition count high?

It can increase memory and CPU utilization of matching hosts.

Q6: Any other factor we may need to tweak for perf using Go client?

Make sure that other worker options do not limit the worker throughput explicitly.