Hello, we’re running a temporal 1.4 cluster with 4096 history shards and Cassandra as our storage. We’re doing a major revamping of our k8s cluster, and we’re trying to figure out, how would one move the temporal cluster from one to the other, without any downtime.
The temporal is completely in the k8s cluster, and Cassandra is on ec2. Could I use multicluster replication? Can I somehow trick the ringpop to work between two clusters? Is this even possible to do without downtime?
Got some info from server team on this,
One thing you could do is have Temporal services deployed on two k8s clusters, both talking to the same persistence and each other. The hard part here would be to set up networking properly, each cluster would need to be in the same VPC and would need security groups configured to allow services to talk to the IPs of pods in both clusters. That or they would need VPC peering and security groups if they are in different VPCs.
You can also look at multi cluster replication but replication I believe would happen on executions that happen after it is set up.
What’s the life span of workflows that are running on your current cluster? Are they pretty short lived or long running?
Great idea to connect ringpop using VPC but I’m afraid we won’t be able to do that since we’re running Calico on our old k8s cluster.
Workflows can be into 2 days lets say, so pretty long. I have an idea though. Let’s say a downtime for a minute is acceptable. Could we quickly stop the cluster in the old k8s and start up a new one in the new k8s, this wouldn’t take long, and the new cluster would just start working from where the old one left off.
Also another question, during this time what If we launched the new cluster and it was NOT able to communicate with the old one. Temporal would stall to a hault I guess but once we stopped the old cluster and ringpop cleared of the old IPs it would continue as if nothing happened, right?
Thanks for your help.