Hi there,
we are in the process of enabling replication for our Temporal servers. We already have multiple instances for which clusterMetadata
looks something like this:
clusterMetadata:
enableGlobalNamespace: true
failoverVersionIncrement: 10
masterClusterName: active
currentClusterName: active
clusterInformation:
active:
enabled: true
initialFailoverVersion: 1
rpcName: "frontend"
rpcAddress: "dns:///<hostname>:7233"
Unfortunately, we chose the same cluster name for all of our installations (probably copied from the docs). This did not matter while we were not using replication, but with replication the names must be unique and we also would like to apply a naming scheme making it easier for us to differentiate between clusters. Ideally, we would of course change the cluster name without downtime. So far without success.
We tried changing the cluster name in the config and redeploying. This leads to a panic panic: Cluster info initial versions have duplicates
. This happens here. Duplicate failover versios are not allowed. I managed to get passed this point by manually deleting the record for the “active” cluster from the cluster_metadata_info
table. However, this creates downtime (since I first have to scale the K8s deployment to 0 to prevent old Pod to recreate the row). Also, I am not sure whether this is a safe operation.
Next we tried to change the cluster name plus the initialFailoverVersion. This way the server comes up and a temporal operator cluster list
shows both “clusters”. However, the existing namespaces are active in the “active” cluster. Trying to execute a workflow will result in:
Error: Namespace: default is active in cluster: active, while current cluster foo is a standby cluster.
Not sure what else to try and whether this is even possible.
The best we can come up with right now is to create a new cluster with a proper name, replicate the “active” cluster to this new cluster and then fail over to this cluster. At this stage we should be able to take down the “active” cluster, change its configuration and bring it back up.
On that note, how does one know that replication between two clusters is complete and it would be safe to take down one cluster.
Thanks in advance for any help.