Temporal DR Strategy

Hi temporal team,

We are working on disaster recovery strategy for our self-hosted temporal server. It will be much appreciated if you guys can chime in.

We started with the manual switchover strategy, and may move to temporal multiple cluster replication feature in the future.

  1. Issues in the server failover
    We use PostgreSQL DB (primary nodes are in the east region, backup nodes are in the west). When we manually switched temporal server from east to west (disable server in the east cluster, and enable the server in the west cluster), We encountered the 504 gateway time out error when visiting Web UI pages and persistent store operation failures logged in history service.

Question: I read the posts in the forum, I guess it is caused by multiple clusters trying to use the same DB node. Is it correct?

To avoid the issue in the failover, here is what I’m thinking for the failover procedures:

Step 1. Disable temporal sever in the east cluster, there is configuration for server enable/disable setting.

Step 2. Wait for 5 minutes, to drain the worker application requests to the server.

Step 3. Switch over PostgreSQL primary node from east cluster to the async standby node in the west cluster.

Step 4. Enable temporal sever in the west cluster.

Do you think following these steps can bypass the DB persistence issue?

Again, thank you so much for temporal team’s continuous help!

We started with the manual switchover strategy, and may move to temporal multiple cluster replication feature in the future.

You should use multi-cluster replication for this use case. dont think your approach described is something that can guarantee correct failover.

We encountered the 504 gateway time out error when visiting Web UI pages and persistent store operation failures logged in history service.

Could you provide more info on this? Interested to understand errors you got

Thank you @tihomir! Temporal official site says “Temporal’s Multi-Cluster Replication feature is considered experimental”. Is it true? Or it is mature enough for self hosted temporal server.


And when two clusters are enabled as primary and standby. When automatic failover happens, standby cluster is set as the new primary. In this scenario, is there any chance that at one point requests from both clusters may be sent to the same database, which may cause inconsistence in the DB?