Business continuity in the case of a regional outage

continuation of the Slack discussion with @samar

Hello, I have a question around setting up the Temporal in order to support the business continuity in the scenario of the regional failure.

Let’s say we run critical business applications across 2 regions and the active region goes down. What would be the best approach in setting up the Temporal to handle the regional outage, pickup incomplete workflows from the failed region and not having to change the worker configurations?

Potential options team considers:

  • Temporal Multi-cluster
  • Cassandra datastore replication

So-far following are the data known:

Workflows:

  • started and replicated: if your activities are idempotent then it should not impact any running workflow executions. As during failover your workflows can resume from a point earlier than last progress made on the other region and reschedule your executed activities.

Signals:

  • received and replicated: any signals from the previous region will be reinserted after the fork in execution history.

What is not yet known

Workflows:

  • started and replicated:
    What happens when the region comes back online, does it do initial sanity check with the DR site to see it if missed a lot of events and actualise the current state of workflows so it does not try to resume everything from the last known state?

  • started and not replicated:
    I guess this depends on the business criticality so the likely decision would be to wait the outage out or re-create on DR?

Signals:

  • received and not replicated:
    What happens when the primary site comes back up and DR was running the workflows for a while, does primary checks back with DR if it should send signals, etc.?
    I guess we also need to re-send any signals “lost in transit” to DR cause of the primary site outage?

Workers:

Does temporal multi-cluster takes care of the client failover to the alternative endpoint without the configuration change OR we need to provide floating IP/DNS/etc. to manage the milti-cluster routing?

Thanks,
A

Also what would be the approach to handle the workflow where the activity is deterministic but not idempotent? In this case we may have a scenario where activity completed in the primary region without state being reflected so it may run again if DR cluster is made primary.

  • Temporal Multi-cluster

Potentially it will be the best option. But currently, it is an experimental feature with a lot of sharp edges.

  • Cassandra datastore replication

This is not an option. Temporal requires a fully consistent DB. The wide area replication of Cassandra doesn’t fit this category.

  • started and replicated:
    What happens when the region comes back online, does it do initial sanity check with the DR site to see it if missed a lot of events and actualise the current state of workflows so it does not try to resume everything from the last known state?

Not yet. It will be possible in the future once a clean failover is implemented. At this point, it is the operator’s responsibility to let the replication backlog to drain before initiating failover back to the original region. If the backlog is not drained the workflows are going to start from the last known state and all events from the remote region will be treated as conflicts (which is going to reapply signals as new signals).

started and not replicated:
I guess this depends on the business criticality so the likely decision would be to wait the outage out or re-create on DR?

Correct.

  • received and not replicated:
    What happens when the primary site comes back up and DR was running the workflows for a while, do primary checks back with DR if it should send signals, etc.?
    I guess we also need to re-send any signals “lost in transit” to DR cause of the primary site outage?

The events from the now passive cluster are going to be applied to the currently active cluster. When a conflict is detected the signals will be applied as new signals.

Does temporal multi-cluster takes care of the client failover to the alternative endpoint without the configuration change OR we need to provide floating IP/DNS/etc. to manage the milti-cluster routing?

Each cluster should have its own workers. The workers on the passive cluster are not going to get any tasks. When failover happens they start receiving tasks.

The frontend can be configured to forward workflow starts and signals as well as queries to the active cluster. Clusters should be able to see each other, but this is needed for replication to work anyway.

1 Like

Activities must be idempotent in the multi-region scenarios.

1 Like

Thanks Maxim,

I have some follow-up questions

  • Cassandra datastore replication

This is not an option. Temporal requires a fully consistent DB. The wide area replication of Cassandra doesn’t fit this category.

Will it work if we run with active-passive with only the data tier replicated and active on DR? In the case of the failure we will update all the workers to point to the DR site and activate the Temporal server which would resume the workflow execution based on the last known replicated state. If all activities are going to be idempotent then the situation may be quite similar to the multi-cluster minus the conflict resolution (which we won’t have cause Temporal Cluster would be a singleton in this setup), minus the zero downtime maintenance cause multi-cluster gives us that in theory.

  • started and replicated:
    What happens when the region comes back online, does it do initial sanity check with the DR site to see it if missed a lot of events and actualise the current state of workflows so it does not try to resume everything from the last known state?

Not yet. It will be possible in the future once a clean failover is implemented. At this point, it is the operator’s responsibility to let the replication backlog to drain before initiating failover back to the original region. If the backlog is not drained the workflows are going to start from the last known state and all events from the remote region will be treated as conflicts (which is going to reapply signals as new signals).

If primary was aborted (due to the region failure) and operator chose to activate DR as the new primary (NP) for the namespace which will resume the workflows and this setup is going to run for quite a while. What is going to happen when the original primary (OP) comes back online? Will OP start re-applying the workflows as it would have been the primary until it receives the sync from NP which would initiate the conflict resolution flow?

If namespace was updated to have NP as a current primary what would happen with the namespace state when OP comes back online? Will it actualise the state and assume the secondary role?

Is there a feature which would could pause the workflow execution on OP?

How can I try the multi-cluster setup? Documentation says reach out to the Temporal team…

Do not use Cassandra data replication for Temporal. Temporal itself maintains lots of internal states (across tables), simply using Cassandra data replication will cause undefined behavior.

Temporal’s own replication stack will bring cluster up to speed once cluster is back online and drain the backlog

2 Likes

Got it, don’t use Cassandra multi-dc replication otherwise thing will eventually fall apart in a very consistent way. I still have a requirement to provide some options around the business continuity, having to run in a single region is going to be a hard sell.

How can I setup the multi-cluster, documentation page does not have much details?

I wouldn’t recommend multi-cluster at this point as it is an experimental feature, especially if your use case involves child workflows.

Considering all things mentioned it does look like the solution is to build a cluster in each region and in the case of the disaster run all the new workload in a healthy region while waiting for the failed region to come online and resume the operations, leaving the DR region to drain the workflows in due time?

Will alternative backend like PGSQL work better for the multi-region scenario? State replication to another region would be consistent so in theory Temporal can resume operation in another region at the point in time in the past.

Temporal architecture should be adjusted to support multi-region DBs. It is not going to work out of the box.

Thanks Maxim

I will put a cluster in each region and when the multi-cluster feature reaches the state we can use it introduce global namespaces. Global and local namespaces are going to be needed cause there will be a split between the deterministic and idempotent workflows.

Would you define what you mean by those terms? I’ve never heard them before.

I guess I have juggled the terminology a bit. It applies to the scenario where Temporal state may experience a data loss during the regional outage due to async multi-cluster replication delay OR when the point in time recovery is performed on the state for yet unknown reason.

Does below makes sense?

“deterministic workflow” contains at least one deterministic activity, it won’t be safe to replay the workflow instance (partially or completely) for the recovery purpose because any deterministic activity is going to change the downstream business system state every time it runs.

“idempotent workflow” contains only the idempotent activities, it will be safe to replay the workflow instance (partially or completely) for the recovery purpose as idempotent activity changes the downstream business system state only once no matter how many times it runs.

Thanks for the explanation.

I call “deterministic activity” a “non-idempotent activity”.

1 Like