What is the correct way to disable & re-enable multi-cluster replication?

I have two v1.19.1 Temporal clusters that were replicating to each other for a long time, then I took one offline for extended DB maintenance and it is no longer receiving new workflow history.

The full order of operations was:

  1. Have two clusters replicating to each other, cluster A is the primary and is the active cluster for every namespace except for one.
  2. I moved the one global namespace from cluster B to cluster A, and progress continued to be made.
  3. I scaled cluster B to zero Kubernetes replicas. I did this without severing connections between the two servers (i.e. admin cluster upsert-remote-cluster --enable_connection false, admin cluster remove-remote-cluster) because in the past I’ve noticed doing this will cause a gap in workflow history in cluster B.
  4. (two weeks pass)
  5. Cluster B is scaled back to its normal number of Kubernetes replicas.
  6. Wait 20 minutes for replication to catch back up.
  7. Move the namespace back from cluster A to cluster B.
  8. Observe that even though worker services have re-connected to cluster B, no progress is being made on a cron workflow that is stuck at WorkflowExecutionStarted, as observed by some custom telemetry that the workflow’s activities emit. The cron workflow is set to execute every 30sec.
  9. Cluster B’s UI (v2.11.1) indicates that the cron workflow is still in a Running state with a start time from before Cluster B was scaled down.
  10. I moved the namespace with the cron workflow back to cluster A, hoping that the connected worker services (as indicated by the workflow’s page in cluster A’s UI) would make progress.
  11. Observe that:
    1. Cluster A says the workflow is stuck in the Running state from before the namespace was moved back to cluster B.
    2. Cluster B says the workflow is still stuck in the Running state from before cluster B was scaled down.
    3. Cluster B does not have recent workflow history for any other namespace either, namespaces that were never moved between clusters for this maintenance.
    4. Cluster A was emitting replication_tasks_lag metrics with a target_cluster of cluster B, extremely infrequently (sometimes with days between the emissions). I have not observed replication_tasks_lag from cluster A or cluster B since.

Some other facts:

  • Global namespaces are enabled for the clusters, and all namespaces are global.
  • Both have a failover increment of 10. Cluster A’s initial failover is 1, and cluster B’s initial failover is 2.
  • The namespace in question currently has a failover version of 21, and the last failover history timestamps I see are from:
    • Version 1: initial
    • Version 2: from 3 months prior
    • Version 11: me moving the namespace from cluster B to cluster A, before cluster B was taken offline
    • Version 12: me moving the namespace from cluster A to cluster B, after cluster B was brought back online
    • Version 21: me moving the namespace from cluster B back to cluster A

Clearly, I missed a step in either taking cluster B offline or bringing it back online, but I’m not sure where. What are the correct steps I should have taken to take cluster B offline, such that it would catch up on replication when it comes back online?

And as a quick follow-up-

I am able to manually start workflows in cluster A, both in the problematic namespace and in others, but none of that history gets replicated to cluster B.

Some other info:

  • Request to Cluster A’s temporal.server.api.historyservice.v1.HistoryService/GetReplicationMessages are hitting the 30sec timeout. The requests are coming from cluster A’s frontend, but I don’t have any tracing past that.
  • Cluster A is emitting persistence_error_with_type metrics with operation:getreplicationtasks & error_type:serviceerrorunavailable at a fairly constant rate. The errors appear to be <1% of the total operation:getreplicationtasks persistence_requests, though.
  • Cluster A’s history service logs indicate a lot of these messages, all with context canceled:
    • Failed to retrieve replication messages.
    • replication task reader encounter error, return earlier
    • Persistent fetch operation Failure
  • Cluster A’s replication_tasks table (MySQL) is only INSERTed to, never DELETEd from. Cluster B’s replication_tasks table has no INSERTs or DELETEs.
  • Cluster A’s replication_tasks_fetched metric is a flat zero.
  • Cluster A has >2.3mil replication_tasks rows, cluster B has zero.
  • There are no obvious errors in the persistence DB telemetry.

I can’t seem to find where the errors are actually coming from, whether it’s a cluster<>cluster connection issue, or if it’s cluster<>DB.

No networking changes were made while cluster B was offline, but here are some details about the setup:

  • The clusters are hosted in Kubernetes in the same namespace.
  • Cluster A and B’s services have different port numbers so they don’t discover each other.
  • The cluster frontend addresses (tctl admin cluster list) aren’t qualified with the Kube namespace, but doing so doesn’t seem to change anything.

I may have found a bug with Temporal, this may be because the workflows in cluster A were completed & deleted per the retention policy, and the replication task isn’t gracefully handling it.

What makes me think that is the logs I mentioned in the last comment, they are always found together in this order:

  • Persistent fetch operation Failure (workflow execution was not found?)
  • replication task reader encounter error, return earlier (the HistoryReplicationTask processing errored)
  • Failed to retrieve replication messages.

I raised a GitHub issue: Replication tasks referencing archived workflow executions can't be processed, blocking all replication · Issue #4348 · temporalio/temporal · GitHub