Synchronization of large number of workflows - best practices?

Our application stores customer & service configuration data for telecom operators. When a change happens (e.g. customer picks a new product or manually changes a setting such as “forwarding number”) these events have to be queued and sent out to variety of system (starting from mobile network components such as HSS and going to CRM/accounting systems) so the information there gets updated. Provisioning workflows for individual systems from our side are fairly simple - usually it is just sending an HTTP/REST API to the external server and waiting for the confirmation of the result. The number of events per second is not that high (up to 1000/sec) - but since an external system may be inaccessible for days, we can get quite many queued events. High-level diagram

We are now evaluating the replacement of the home-developed queue manager (using DB tables) and workflow workers with Temporal.

The biggest challenge so far seems to be the issue of concurrency and potential race conditions. For certain types of events (e.g. “Customer exceeded his/her download quota”) we can have highly concurrent number of workflows and interactions with other systems. For instance for “data transfer quota exceeded” event: if customers A1, A2, … A100 used up their quota 1 second after another - it does not really matter if we call “send_sms_notification” API method on external system in a different order, so A99 gets his SMS a bit earlier than A1).

For others there is a race condition based on a certain resource/attribute. E.g. if administrator re-assigns phone number 206-555-1234 from A1 to A2, it is required that first A1 is de-registered on the mobile network and only then A2 is given that number (otherwise the operation will fail). So workflow “provision customer A2” has to wait until “provision customer A1” completes (and makes the phone number becomes available).

Anybody had dealt with similar challenge? I’ve seen the post about limiting number of workflows to 1 and recommendation there to use the mutex workflow, but have someone used it in a situation when there is large number of concurrently active workflows?! Or have you used something else to do a similar task?

Thanks!

TLDR: Your use case fits the Temporal programming model pretty well.

Temporal scales out with the number of workflow executions (instances). We tested up to a couple hundred million open workflows. But it can go much higher as the system capacity is usually restricted by the number of state transitions per second, not the overall number of open workflows.

Temporal doesn’t scale up the size of single workflow execution. For example, it is not really possible to have a single execution handling 100 signals per second or executing 1 million activities directly.

As in your case, each execution is not expected to be very large and handle large rates of events and you scale out with the number of executions the Temporal is a very good fit.

I believe that having a workflow execution per customer and using it to ensure consistency of operations that apply to the given customer would simplify your programming model. You might decide to have workflows for entities like a telephone number to coordinate the transfer. Such entities can be long lived or exist only during the transfer operation. It relies on the strongly consistent nature of Temporal workflows and guarantees that only one workflow with given business id can exist at a time.