I had started a thread in Slack about trying to find other users of XDC that exercise failovers of XDC regularly. While it’s experimental, it’s a crucial feature for us at Netflix to have a multi-region presence (especially for a Platform service like this), so I’m seeking out others who have been playing with it or actively depend on it like we do.
This is a post of what we know, what we’re doing and what our experience of the feature is so far. I’m eager to learn where we might improve, and also provide whatever additional info might be helpful for future users!
Background
We currently operate Temporal out of AWS us-west-2
and us-east-1
. We’ve elected to name each cluster after the region to make things easier to reason about: The Temporal doc examples show having clusters named “active” and “passive”, however this gets confusing once you failover and the cluster named “active” is no longer active. It’s important to note that a cluster cannot be renamed.
We’ve not experienced many problems with XDC on a day-to-day basis. Replication task lag is normally around ~170ms. My current understanding (based on intuition), is that this metric is how long it takes for the replication worker to actually process a task, rather than it being the end-to-end replication lag to deliver the task to passive clusters.
Our cross-regional latencies are quite low, so we’ve not observed any substantial issues here, although we’ve not yet attempted replicating across oceans.
Internal Temporal SDK
We offer a Netflix Temporal SDK, which layers on additional functionality for our customers atop the Java Temporal SDK. This includes integration with Spring Boot and a variety of Platform features and services, and includes a handful of conventions and higher-level abstractions that customers can use to be safer and more productive. One such abstraction is managed handling of Temporal in an XDC topology.
Following guidance of other posts in the community, an application will create a WorkerFactory
for each cluster. A (partial) example of an application config, (note: the URIs are scrubbed; we use FQDNs in reality):
temporal:
enabled: true
defaultNamespace: spinnaker
clusters:
uswest2:
uiTarget: https://temporal-main-uswest2
workerTarget: temporal-main-uswest2:8980
useast1:
uiTarget: https://temporal-main-useast1
workerTarget: temporal-main-useast1:8980
workers:
cloud-operation-runners:
awaitTerminationTimeout: PT5M
enabled: true
titus-cloud-operations:
awaitTerminationTimeout: PT20M
enabled: true
aws-cloud-operations:
awaitTerminationTimeout: PT20M
enabled: true
For each cluster, we create a WorkflowServiceStubs
and WorkflowClient
. Then, we iterate over the workers map, creating a WorkerFactory
for each cluster. So, in the above example, we’d end up with 6 WorkerFactory
objects in the Spring context. All WorkerFactory
objects are polling by default, so if we as Temporal operators were to failover Temporal itself, applications would continue to process work without skipping a beat.
We’re aiming for Temporal to be as hands-off operationally as possible for our customers. Our SDK comes with an XdcClusterSupport
class which regularly polls Temporal for the active cluster of the namespaces the application is connected to. Rather than developers injecting WorkflowServiceStubs
or WorkflowClient
directly and figuring out which one they should use, we have Provider classes for each (as well as WorkerFactory
), which will automatically select the correct object for the namespace’s active cluster. For example:
@Component
public class WorkerFactoryProvider(
private val workerFactories: Map<String, WorkerFactory>,
private val xdcClusterSupport: XdcClusterSupport
) {
/**
* Get the [WorkerFactory] by its [name] attached to the active cluster.
*/
public fun get(name: String, active: Boolean = true): WorkerFactory {
return workerFactories
.filter { it.key.split("-").first() == name }
.values
.firstOrNull {
val workflowClient = it.workflowClient
if (workflowClient !is ClusterAware) {
false
} else {
val activeCluster = xdcClusterSupport.getActiveCluster(workflowClient.options.namespace)
(workflowClient.clusterName == activeCluster) == active
}
}
?: throw NetflixTemporalException("WorkerFactory for '$name' not found")
}
}
The XDC integration in the SDK also integrates with our dynamic config service so we as Temporal operators can globally (or singularly) signal applications to disable polling of a specific cluster if we need to actually take the cluster down for maintenance. This is especially important for us since we use SQL persistence, so we can’t simply scale out our persistence to meet new load demands like we could with Cassandra–we have to take cluster downtime.
Existing Issues
These aren’t all-inclusive, but they’re the ones we’ve identified so far in our failover exercises.
io.temporal.client.WorkflowExecutionAlreadyStarted
When using child workflows, parent workflows can fail with this exception during a failover. We hypothesize that it’s a race condition of parent & child workflows being assigned into different shards, and thereby not necessarily replicated at the same time.
In this case, it’s possible that the parent workflow starts a child workflow, and that child workflow’s shard is replicated to the new cluster, we then perform a failover, and the parent workflow’s event history is lagged and attempts to start the child again.
We’ve only observed this problem once. I’m not really sure what the resolution would be for this. Can we catch the exception and then populate our own child workflow stub? It’s a difficult condition to reproduce and test; we’ll likely need to create a new loadtest scenario for Maru that exercises child workflows.
SDK metrics don’t include cluster names
Right now, it’s not possible to associate metrics (e.g. polling metrics) with a particular Temporal cluster, as metric tags for the cluster are not included. We’re not totally sure if this is actually a material issue per-se, but it would help us to signal whether or not we’ve actually seen applications stop polling the desired cluster. Today, we can infer that a cluster is no longer receiving polling traffic from a cluster based on the the task poll metrics dropping, so that’s fine.
Future Improvements
We only support the Java SDK right now, but our “Paved Road” supports other languages which we’ll need feature parity with. In the future, I’d like to explore writing a global routing service that implements the Temporal gRPC API that would handle routing to the correct active cluster. There is the dcReplicationPolicy
configuration that Temporal offers to route a subset of API traffic from a passive cluster to the active, but that behavior only works in the happy path when the cluster is actually healthy and online.