However, with the replication delay, it sounds like I could send a signal to a workflow and the active region could go down before that event was replicated to the standby region?
By “won’t get lost”, does it mean that if the signal didn’t get replicated, it’ll be processed once the previously active region recovers?
Yes. With the given extreme scenario that the cluster is down right after acknowledged the signals, the signals will be recovered after the source region recovers. However, the replication lag is at millisecond level, so this scenario is very unlikely to happen.
OK, that makes sense. Is this correct… my understanding is that signals of the same signal method name are normally delivered to a workflow in order. It sounds like this would no longer be guaranteed with multi-region namespaces?
At the level of an individual workflow, of course. Running a lot of simultaneous workflows would raise the probability that a rare event happens to at least one of them.
On the other hand, usually of course a region doesn’t just crash, instead performance is degraded and that would give time for events to be replicated.
The lack of an ordering guarantee for signal delivery would be important to highlight, as a workflow that worked correctly when it processed signals in order might not if signals are reordered.
The signal reorder can only happen when there is a failover and conflict resolution happens. If the workflow execution logic is aware of the ordering, then it is not guarantee on signal ordering as they are ordered by when the server receives and records the signal. Users has to include some ordering information in the signal payload to avoid race condition.
Yes, of course, if I’m aware of the new requirement then I can design my workflows to continue to work correctly in the presence of rare signal reordering.
My point is simply that it would be good to mention this in the documentation, so that people would know what changes they might need to make to their workflow implementations to continue to get reliable execution when using multi-region.