Business continuity in the case of a regional outage

  • Temporal Multi-cluster

Potentially it will be the best option. But currently, it is an experimental feature with a lot of sharp edges.

  • Cassandra datastore replication

This is not an option. Temporal requires a fully consistent DB. The wide area replication of Cassandra doesn’t fit this category.

  • started and replicated:
    What happens when the region comes back online, does it do initial sanity check with the DR site to see it if missed a lot of events and actualise the current state of workflows so it does not try to resume everything from the last known state?

Not yet. It will be possible in the future once a clean failover is implemented. At this point, it is the operator’s responsibility to let the replication backlog to drain before initiating failover back to the original region. If the backlog is not drained the workflows are going to start from the last known state and all events from the remote region will be treated as conflicts (which is going to reapply signals as new signals).

started and not replicated:
I guess this depends on the business criticality so the likely decision would be to wait the outage out or re-create on DR?

Correct.

  • received and not replicated:
    What happens when the primary site comes back up and DR was running the workflows for a while, do primary checks back with DR if it should send signals, etc.?
    I guess we also need to re-send any signals “lost in transit” to DR cause of the primary site outage?

The events from the now passive cluster are going to be applied to the currently active cluster. When a conflict is detected the signals will be applied as new signals.

Does temporal multi-cluster takes care of the client failover to the alternative endpoint without the configuration change OR we need to provide floating IP/DNS/etc. to manage the milti-cluster routing?

Each cluster should have its own workers. The workers on the passive cluster are not going to get any tasks. When failover happens they start receiving tasks.

The frontend can be configured to forward workflow starts and signals as well as queries to the active cluster. Clusters should be able to see each other, but this is needed for replication to work anyway.

1 Like