Throughput not improving

During stress testing I am unable to find any obvious bottleneck. CPU usage in workers and Temporal servers are less than 50%. For Cassandra servers, it is less than 30%. I have tried changing different configurations I am aware of, but the throughput is always the same. I tried the following configurations:

numHistoryShards: 512 and 1024
matching.numTaskqueueReadPartitions: 16

WorkerFactoryOptions.setMaxWorkflowThreadCount: 600, 1200
WorkerFactoryOptions.setWorkflowCacheSize: 600, 1200
WorkerFactoryOptions.setWorkflowHostLocalPollThreadCount: default, 20, 40

WorkerOptions.setMaxConcurrentActivityExecutionSize: 200, 400
WorkerOptions.setMaxConcurrentWorkflowTaskExecutionSize: 200, 400
WorkerOptions.setMaxConcurrentLocalActivityExecutionSize: 200, 400
WorkerOptions.setMaxConcurrentActivityTaskPollers: default, 20, 40
WorkerOptions.setMaxConcurrentWorkflowTaskPollers: default, 20, 40

Please let me know if I am missing anything.

Also, is there any command I can execute to see the numHistoryShards in the cluster?

I saw improvement in throughput when I used 2 instances of Frontend, each with 2 core, 4GB RAM. Previously I was using 1 instance with 4 core, 8GB RAM. I made same change for Matching also. I kept the History server unchanged, 2 instances, each with 4 core, 8GB RAM.

I am assuming Frontend was the bottleneck since none of the other changes I have been trying so far made any difference.

  • Do we get better throughput when we use more number of smaller Frontend servers? For example, should I use 4 instances of Frontend, each with 1 core, 2GB RAM?
  • How do we detect such bottlenecks since CPU usage of Frontend was less than 50%?
  • Should I try similar changes for History and Matching servers to achieve better throughput with same hardware?

@maxim @tihomir

Sorry for the late reply, can you update this post to the latest info you have so we can take a look?

matching.numTaskqueueReadPartitions: 16

Do you see any improvements by lowering this to 8?

Also, is there any command I can execute to see the numHistoryShards in the cluster?

Yes you could use tctl, for example:

tctl adm cl d | jq .historyShardCount

Also can you check your frontend service metrics and look for any service errors:

sum(rate(service_error_with_type{service_type="frontend"}[5m])) by (error_type)