Hi @nathan_shields, happy to answer!
Does this mean that failover from one cluster to another is a manual process, or at least doesn’t come “out of the box” with XDC?
Failover is manually initiated via tctl
, for example:
tctl --ns default n up --ac useast1
There are various ways you can approach making the application zero-effort on failovers, but out of the box, the SDKs don’t yet “support” hands-off failovers. Our specific approach makes individual application processes fatter by creating workers that connect to each Temporal Cluster. We have an XdcClusterSupport
class that polls the Temporal API to figure out which cluster is active for the namespaces that the application is interested in, then it automatically enables/disables polling on workers as needed so we’re not generating unnecessary load on inactive clusters. This specific behavior is something we’ve considered building a global Temporal API proxy for so that applications don’t need to concern themselves with (they’d just connect to “the” Temporal API, which would route requests appropriately).
However, from an application owner’s perspective, they don’t need to do anything during a failover – our extension SDK makes failovers entirely hands-off for them.
The docs specify that the data between clusters is not strongly consistent. Has this caused any issues for you?
It has, yes, but IIRC only specific to Child Workflow usage. Replication is performed independently at the shard level and children workflows are not necessarily going to be in the same shard as the parent. Therefore, it’s possible that a child’s history is replicated sooner than the parent, so during a failover you can experience “child workflow execution already exists” exceptions because the parent history hasn’t been replicated yet to know it already started the execution.
Your workflow code must consider this potential edge case. Following our intent to make failovers as hands-off as possible for application developers, we have a utility Dispatcher
class that Workflows can use that handle the various error conditions associated with using Child Workflows.
It’s my understanding that the Temporal team is planning to make behaviors around replication a little safer without building up the utilities that we have done.
I see that Cadence has an “api-forwarding” feature that lets passive clusters forward signal/start workflow requests to the active cluster. Did you figure out if Temporal supports this?
Temporal does support this and it should definitely be enabled in an XDC topology. While it behaves as advertised, it is an incomplete solution for all failure modes. Specifically, this functionality is not going to help you if the Temporal inactive cluster is unavailable for whatever reason (e.g. the cluster is physically unavailable). That’s why we built up behavior that does not solely depend on it – an application owner should not be impacted by a cluster disappearing entirely for a period of time (like if we need to upgrade our SQL server).
Hope that helps, happy to answer more / clarify as needed.