Hi,
We’re hosting Temporal on our own k8s cluster. Since the number of history shards can’t be changed after initial startup, we’re running some load tests to determine scalability and balance it with idle resource usage. Our loadtest tests various scenarios, like Echo activity, Sleep activity, Workflow sleeps, Child workflows in a tree scenarios etc.
We tested with various history shard counts, and can see the increase in number of similar workflows completed per second by increasing the history shard count, but it stops increasing at some point (around 2k history shards with 200 workflows/s with ~20 echo activities). We’ve tried increasing taskQueuePartitions, various RPS limits on each service, scaling up Temporal server instances, adding more workers etc.
Is there a metric we can use to determine whether the number of history shards is a bottleneck (e.g. delay in acquiring shard or workflows waiting for a shard to write to persistence) or if the bottleneck is coming from some other limits?
What is the general guideline to determine number of history shards? Say we want to run 200 workflows/s, each with ~20 activities.
Thanks,