We are going to have 10s of millions of workflow running at the same time for our use case. Does temporal support this scale?
We tested it up to hundreds of millions. The size of the Temporal cluster is defined not by the number of parallel workflows, but by the number of operations they have to run per second.
what is the operation here? like event creation, searching etc…? so the operation should be at the scale of # of workflows * # of activities, right?
Operation is anything that adds events to the workflow execution history. For example an activity invocation, durable timer, child workflow, signal etc.
The number of operations per second heavily depends on the use case. For example, you can have a hundred million workflows waiting on an evenly distributed 6-month timer. So the rate of updates will be 6 timers per second which is very low and can be supported by a small cluster. At the same time, 10k workflows each running an activity per second is going to execute 10k activities per second which would require a much larger cluster.
gotcha, so it is horizontally scalable anyway, we can increase the instances when we observed high average cpu utilization rate. Is there any documentation for how to set up a production temporal cluster and the best practice for operating and maintaining
We are working on such documentation. For development we recommend using K8s through the supported helm charts.