Temporal DR Strategy

andyng · December 5, 2024, 1:07am

Hi temporal team,

We are working on disaster recovery strategy for our self-hosted temporal server. It will be much appreciated if you guys can chime in.

We started with the manual switchover strategy, and may move to temporal multiple cluster replication feature in the future.

Issues in the server failover
We use PostgreSQL DB (primary nodes are in the east region, backup nodes are in the west). When we manually switched temporal server from east to west (disable server in the east cluster, and enable the server in the west cluster), We encountered the 504 gateway time out error when visiting Web UI pages and persistent store operation failures logged in history service.

Question: I read the posts in the forum, I guess it is caused by multiple clusters trying to use the same DB node. Is it correct?

To avoid the issue in the failover, here is what I’m thinking for the failover procedures:

Step 1. Disable temporal sever in the east cluster, there is configuration for server enable/disable setting.

Step 2. Wait for 5 minutes, to drain the worker application requests to the server.

Step 3. Switch over PostgreSQL primary node from east cluster to the async standby node in the west cluster.

Step 4. Enable temporal sever in the west cluster.

Do you think following these steps can bypass the DB persistence issue?

Again, thank you so much for temporal team’s continuous help!

tihomir · December 6, 2024, 1:23pm

We started with the manual switchover strategy, and may move to temporal multiple cluster replication feature in the future.

You should use multi-cluster replication for this use case. dont think your approach described is something that can guarantee correct failover.

We encountered the 504 gateway time out error when visiting Web UI pages and persistent store operation failures logged in history service.

Could you provide more info on this? Interested to understand errors you got

andyng · December 6, 2024, 5:19pm

Thank you @tihomir! Temporal official site says “Temporal’s Multi-Cluster Replication feature is considered experimental”. Is it true? Or it is mature enough for self hosted temporal server.

And when two clusters are enabled as primary and standby. When automatic failover happens, standby cluster is set as the new primary. In this scenario, is there any chance that at one point requests from both clusters may be sent to the same database, which may cause inconsistence in the DB?

Topic		Replies	Views
High Availability Cross Region Deployment Server Deployment cassandra , multicluster , postgresql	0	94	October 2, 2024
Business continuity in the case of a regional outage Community Support cassandra , multicluster	14	1624	September 1, 2022
How passive cluster knows standby Postgres db when failover Community Support general-impl	4	42	October 29, 2024
What needs to be configured for self-hosting multi-instances of temporal server? Server Deployment general-impl	0	67	April 2, 2025
Multi-datacentre issue Community Support deployment	1	635	April 13, 2022

Temporal DR Strategy

Related topics