We are running two Temporal clusters in separate datacentres backed by separate Postgres databases. Both Temporal and the consumer application are running in OpenShift. This ensures we can perform blue/green deployments and zero downtime upgrades.
User and webhook traffic is load balanced between our two datacentres but the load balancing makes resubmitting/resuming workflows unreliable because the traffic may get routed to a different DC than the DC where the workflow originated.
We’re trying to figure out how to deal with this situation without introducing a single point of failure by having a single Temporal cluster. Any suggestions would be appreciated.