We are doing some testing with Temporal to understand how to tune all the available options to improve the throughput of the system.
We are using three worker nodes to execute tasks, each with 4 cores and 8GB memory. The workflow has 5 local activities, each takes 5-10ms.
We found that the CPU usage is low, 20%. We are unable to increase the throughput (combined throughput 210 approximately) or the CPU usage even after trying different configuration options.
We have modified the following values:
WorkerFactoryOptions.setMaxWorkflowThreadCount - upto 2400
WorkerFactoryOptions.setWorkflowCacheSize - upto 2400
WorkerOptions.setMaxConcurrentActivityExecutionSize - upto 800
WorkerOptions.setMaxConcurrentWorkflowTaskExecutionSize - upto 800
WorkerOptions.setMaxConcurrentLocalActivityExecutionSize - upto 800
WorkerFactoryOptions.setWorkflowHostLocalPollThreadCount - upto 240
WorkerOptions.setWorkflowPollThreadCount - upto 240
WorkerOptions.setActivityPollThreadCount - upto 240
numHistoryShards - 512, default in Helm Chart
matching.numTaskqueueReadPartitions - Unchanged, the default value is 4
matching.numTaskqueueWritePartitions - Unchanged, not sure about the default value
- Is there any other configuration that can be adjusted to improve the throughput?
- Are there any recommended settings for the above configurations?
- What can be a possible bottleneck here?
- What is the difference between WorkerFactoryOptions.setWorkflowHostLocalPollThreadCount and WorkerOptions.setWorkflowPollThreadCount?