We are looking to build a workflow platform on top of Temporal. There will a multiple workflows (in the order of 100s), but each workflow will have millions of concurrent executions(10s to 100s of million). Some of the workflows will be listening/waiting for external signal, while some will be actively running.
Is this scale manageable with Temporal? Are there any customers who have running millions of concurrent executions in production?
TLDR; Yes, there are production use cases that have hundreds of millions of open workflows all the time.
The scalability of a Temporal cluster is mostly defined by the number of state transitions (which roughly correspond to the number of DB updates) per second. The number of open workflows doesn’t affect the scalability assuming that DB has enough disk capacity.
So in your case, you should be concerned about the number of signals/activities/child-workflows/etc that you want to execute per second.
Got it - so reframing my question, what’s the maximum number of signals/activities/child-workflows/etc that can be executed per second? Going through the documentation it looks like this depends on the db capacity, but wanted to get a general sense in terms of the scale that is possible assuming we pick the largest Aurora instance/Vitess instance.
It depends on the specific workflow being executed. The best way to measure the possible throughput on the given hardware is to either execute real workflows or rely on the Maru load simulator. Here is a blog post about Maru.