Tuning Temporal setup for better performance

Our team currently evaluates whether Temporal is suitable for our needs. I’ve performed a simple load testing on our test environment.
We’re using Temporal version 0.28 deployed from the helm chart to the GKE on n1-standard hosts.

To simulate a load, I use a slightly modified HelloActivity from the java-samples repository. I’m pushing workflow executions at different rates and monitoring how much Temporal can handle without an increase in latency.

With the default setup from the helm-chart, I got a pretty stable value of 20 workflow executions per second. Then, after adding extra memory to Cassandra nodes and scaling up Temporal services, I’ve managed to achieve 50 executions per second, but no more than that. Scaling Cassandra to 6 nodes had no visible effect.

Can you please give some advice on how to improve the performance further? Which Cassandra set up should I use? How to look up for possible bottlenecks? Is it possible to tune Temporal services somehow?

Thanks!

The first expected bottleneck is a single task queue throughput. Make sure that the task queues used by your example have enough partitions as they are not autoscaled yet. These are configured through dynamic config.

Also Java client might need number of polling threads adjusted to increase throughput. For workflow task list adjust WorkerFactoryOptions.workflowHostLocalPollThreadCount. And for activity task list adjust WorkerOptions.activityPollThreadCount.

1 Like

@maxim
Q1: For Go clients do we need to the poller count?
I can see the following poller configurations:

MaxConcurrentActivityTaskPollers
and
MaxConcurrentDecisionTaskPollers

The default value for these are 2.

Q2: What would you recommend for each of these at 50, 100, 200, 300, 400, 500 tps?

Q3: Any drawbacks of keeping this number very high?

Q4: What should be kept as the task list partition count? Somewhere I read you recommending 15 partitions for 1k tps.

Q5: Any drawbacks of keeping task list partition count high?

Q6: Any other factor we may need to tweak for perf using Go client?

The desired poller count depends on the latency from worker to the service. Higher the latency lower the throughput of a single poller thread. If the number of poller processes is small you can try increasing the poller count to 5-10 pollers.

Q3: Any drawbacks of keeping this number very high?

The high number will not increase performance but put additional load on the service especially if the number of worker processes is high.

Q4: What should be kept as the task list partition count? Somewhere I read you recommending 15 partitions for 1k tps.

I would allocate one partition for 50-80 tasks per second depending on DB. So 15-20 for 1k task per second sounds reasonable.

Q5: Any drawbacks of keeping task list partition count high?

It can increase memory and CPU utilization of matching hosts.

Q6: Any other factor we may need to tweak for perf using Go client?

Make sure that other worker options do not limit the worker throughput explicitly.

Main parameters to tune

Defining pooling the tasks from the server:

WorkerOptions#workflowPollThreadCount

WorkerOptions#activityPollThreadCount

WorkerOptions#maxConcurrentWorkflowTaskExecutionSize

WorkerOptions#maxConcurrentActivityExecutionSize

Defining the in-memory cached workflows state and threads:

WorkerFactoryOptions#maxWorkflowThreadCount

WorkerFactoryOptions#workflowCacheSize

Some reasonable limitations for these values

  1. WorkerFactoryOptions#workflowCacheSizeWorkerFactoryOptions#maxWorkflowThreadCount. Having a cache larger than the size of the thread pool doesn’t make much sense.
  2. WorkerOptions#maxConcurrentWorkflowTaskExecutionSizeWorkerFactoryOptions#maxWorkflowThreadCount. maxWorkflowThreadCount should be ideally at least 2x of maxConcurrentWorkflowTaskExecutionSize to be safe, but it depends on how actively an app uses threads and also how active the workflows are. Having maxConcurrentWorkflowTaskExecutionSize > maxWorkflowThreadCount doesn’t make sense at all and it’s a misconfiguration because for each workflow only one Workflow Task can be processed at a single moment of the time anyway.

The desired poller count

depends on

  1. The latency from the worker to the service. Higher latency lowers the throughput of a single poller thread.
  2. If you see that worker threads are not getting loaded with enough job, at the same time schedule_to_start latencies are high, you can try increasing the poller count to 5-10 pollers.
  3. How big are the workflow tasks (how large is the average amount of time between consecutive blockage of workflow execution on activities or waiting/sleeping)? Smaller workflow tasks your application has, larger pollers/executors ratio you need.

The desired executors count

depends on the utilization of resources of your worker. If the worker is underutilized, increase maxConcurrent*ExecutionSize.

Workflow tasks

It doesn’t make much sense to set WorkerOptions#maxConcurrentWorkflowTaskExecutionSize value too high. Because Workflow code shouldn’t have blocking operations and waits [other than the ones provided by Workflow class], Workflow Tasks should occupy a full core during execution. This means that it doesn’t make much sense to set WorkerOptions#maxConcurrentWorkflowTaskExecutionSize into something much higher than the amount of the cores.

Activities

WorkerOptions#maxConcurrentActivityExecutionSize should be set looking into the profile of your Activities. If the Activities are mostly computational, it doesn’t make much sense to set it into something larger than a number of available cores. But if Activities perform mostly input-output awaiting RPC calls, it makes sense to increase this number by a lot.

Drawbacks of putting “very large values”

As with any multithreading system, specifying too large values will lead to too many active threads performing work and constant resources stealing and switching, which will decrease the total throughput and latency jitter of the system.

1 Like

@spikhalskiy

Thanks the info is really helpful. Based on your guidance seems to be based off the number of cores. And adjusting the poller count based off the schedule_to_start metrics.

Questions:

  1. Are any of the metrics useful in tuning these numbers? Is it just strictly the workflow and activity execution latency? Does the other metrics like task queue poll, sticky_cache_miss, sticky_cache_forced_eviction, etc. give any insights?

  2. In my use case, the activities mainly do waiting RPC calls. I guess if i see the temporal_activity_schedule_to_start_latency high, then I should probably increase the WorkerOptions#maxConcurrentActivityExecutionSize.

  3. for maxConcurrentWorkflowTaskExecutionSize how would you profile that? Sounds like it should never be bigger than the cores. If you see high temporal_workflow_Task_Scehdule_to_start_latency, would that imply that the resources need to be scaled up?

  4. For the activity and workflow pollers, i noticed that the default values are pretty low 2 workflow / 5 activities. Does this mean that for each task queue, there will be 2 workflows / 5 activities poll which all together is 4 worker polls and 10 activities polls? In this case there is almost 2x the number of activities for workflows. Is the activity reflective of how many activities a workflow can execute?

  5. Also is there any guidance on #setWorkflowHostLocalPollThreadCount(int). In the code, it states that you should increment this before incrementing the worker poll count.

  6. When you are configuring autoscaling, would you recommend profiling with one worker and load test it to determine the ideal numbers. The effectively you would just scale up based on the start_to_schedule_P90 for the workflow metrics?

Thanks,
Derek

3 Likes