Hi temporal team,
We are working on disaster recovery strategy for our self-hosted temporal server. It will be much appreciated if you guys can chime in.
We started with the manual switchover strategy, and may move to temporal multiple cluster replication feature in the future.
- Issues in the server failover
We use PostgreSQL DB (primary nodes are in the east region, backup nodes are in the west). When we manually switched temporal server from east to west (disable server in the east cluster, and enable the server in the west cluster), We encountered the 504 gateway time out error when visiting Web UI pages and persistent store operation failures logged in history service.
Question: I read the posts in the forum, I guess it is caused by multiple clusters trying to use the same DB node. Is it correct?
To avoid the issue in the failover, here is what I’m thinking for the failover procedures:
Step 1. Disable temporal sever in the east cluster, there is configuration for server enable/disable setting.
Step 2. Wait for 5 minutes, to drain the worker application requests to the server.
Step 3. Switch over PostgreSQL primary node from east cluster to the async standby node in the west cluster.
Step 4. Enable temporal sever in the west cluster.
Do you think following these steps can bypass the DB persistence issue?
Again, thank you so much for temporal team’s continuous help!