I have two v1.19.1 Temporal clusters that were replicating to each other for a long time, then I took one offline for extended DB maintenance and it is no longer receiving new workflow history.
The full order of operations was:
- Have two clusters replicating to each other, cluster A is the primary and is the active cluster for every namespace except for one.
- I moved the one global namespace from cluster B to cluster A, and progress continued to be made.
- I scaled cluster B to zero Kubernetes replicas. I did this without severing connections between the two servers (i.e.
admin cluster upsert-remote-cluster --enable_connection false
,admin cluster remove-remote-cluster
) because in the past I’ve noticed doing this will cause a gap in workflow history in cluster B. - (two weeks pass)
- Cluster B is scaled back to its normal number of Kubernetes replicas.
- Wait 20 minutes for replication to catch back up.
- Move the namespace back from cluster A to cluster B.
- Observe that even though worker services have re-connected to cluster B, no progress is being made on a cron workflow that is stuck at
WorkflowExecutionStarted
, as observed by some custom telemetry that the workflow’s activities emit. The cron workflow is set to execute every 30sec. - Cluster B’s UI (v2.11.1) indicates that the cron workflow is still in a Running state with a start time from before Cluster B was scaled down.
- I moved the namespace with the cron workflow back to cluster A, hoping that the connected worker services (as indicated by the workflow’s page in cluster A’s UI) would make progress.
- Observe that:
- Cluster A says the workflow is stuck in the Running state from before the namespace was moved back to cluster B.
- Cluster B says the workflow is still stuck in the Running state from before cluster B was scaled down.
- Cluster B does not have recent workflow history for any other namespace either, namespaces that were never moved between clusters for this maintenance.
- Cluster A was emitting
replication_tasks_lag
metrics with atarget_cluster
of cluster B, extremely infrequently (sometimes with days between the emissions). I have not observedreplication_tasks_lag
from cluster A or cluster B since.
Some other facts:
- Global namespaces are enabled for the clusters, and all namespaces are global.
- Both have a failover increment of 10. Cluster A’s initial failover is 1, and cluster B’s initial failover is 2.
- The namespace in question currently has a failover version of 21, and the last failover history timestamps I see are from:
- Version 1: initial
- Version 2: from 3 months prior
- Version 11: me moving the namespace from cluster B to cluster A, before cluster B was taken offline
- Version 12: me moving the namespace from cluster A to cluster B, after cluster B was brought back online
- Version 21: me moving the namespace from cluster B back to cluster A
Clearly, I missed a step in either taking cluster B offline or bringing it back online, but I’m not sure where. What are the correct steps I should have taken to take cluster B offline, such that it would catch up on replication when it comes back online?