Our application stores customer & service configuration data for telecom operators. When a change happens (e.g. customer picks a new product or manually changes a setting such as “forwarding number”) these events have to be queued and sent out to variety of system (starting from mobile network components such as HSS and going to CRM/accounting systems) so the information there gets updated. Provisioning workflows for individual systems from our side are fairly simple - usually it is just sending an HTTP/REST API to the external server and waiting for the confirmation of the result. The number of events per second is not that high (up to 1000/sec) - but since an external system may be inaccessible for days, we can get quite many queued events. High-level diagram
We are now evaluating the replacement of the home-developed queue manager (using DB tables) and workflow workers with Temporal.
The biggest challenge so far seems to be the issue of concurrency and potential race conditions. For certain types of events (e.g. “Customer exceeded his/her download quota”) we can have highly concurrent number of workflows and interactions with other systems. For instance for “data transfer quota exceeded” event: if customers A1, A2, … A100 used up their quota 1 second after another - it does not really matter if we call “send_sms_notification” API method on external system in a different order, so A99 gets his SMS a bit earlier than A1).
For others there is a race condition based on a certain resource/attribute. E.g. if administrator re-assigns phone number 206-555-1234 from A1 to A2, it is required that first A1 is de-registered on the mobile network and only then A2 is given that number (otherwise the operation will fail). So workflow “provision customer A2” has to wait until “provision customer A1” completes (and makes the phone number becomes available).
Anybody had dealt with similar challenge? I’ve seen the post about limiting number of workflows to 1 and recommendation there to use the mutex workflow, but have someone used it in a situation when there is large number of concurrently active workflows?! Or have you used something else to do a similar task?