Capacity Planning for a Higher Throughput Temporal Cluster

Hi folks. We are investigating the feasibility and resource requirements of a somewhat high throughput, low latency Temporal cluster. Here are the high level requirements:

  • Throughput:
    • Short term: 100 wf completion / seconds
    • Mid to long term: 300 - 500 wf completion / seconds
  • Latency:
    • Workflow schedule to start: less than 10 seconds
    • Activity / task schedule to start: less than 10 seconds

For easy maintenance, we’d like to use RDS for our persistence layer. Will MySQL / Aurora provide enough performance for our use case? What would be a good choice for history shard? Is 2048 to much?

In addition, what will be the recommended resource setting for Temporal service pods? (We are hosting them in K8S). And what other dynamic configs we should tune to achieve the performance goal?

I noticed that there isn’t much doc about operating / tuning self managed Temporal cluster in production. Is there any good articles about it?

Cheers folks! Thank you so much for the help