Temporal Multi-Server Deployment on Openshift

sagheerahmed · October 26, 2023, 12:32pm

Hey Team,
I have read many posts and I’m aware of the below:

One temporal client cannot connect to 2 active clusters
Multi-cluster is experimental

Problem statement:
I have 2 fully isolated openshift clusters in 2 DCs. Each cluster runs the same number of microservices (workers).
I want to deploy Temporal Server (4 temporal services) on both so that each cluster has its own Temporal Server. Each temporal service will scale differently as its deployed independently and not packaged in a single container. These microservices/workers & temporal services cannot communicate to each other across DCs.

Which of these or none of these 2 options is good?

Ran a few scenarios with SetUp1 to check cross-site processing. The behaviour is not consistent.
If worker1 is down on site1, worker1 on site 2 picks up to complete the workflow and sometimes it doesn’t.

Set-up 1: Two Temporal Servers sharing 1 schema - If cross-site interaction happens due to a common schema, i.e; If a worker on site 2 can process a workflow that was initiated by site 1 then we don’t need to maintain a passive instance and HA/DR is achieved straight away incase one of the Temporal Servers or Workers go down.

Set-up 2: Two Temporal Servers having its own schema (on same physical DB) - This will not have cross-site interaction due to separate schemas. A passive instance needs to be maintained incase one of the site (either Temporal or workers or both) goes down.

Thank you.

maxim · October 27, 2023, 6:00pm

All Temporal cluster nodes should be able to communicate with each other directly. So, your Set-up 1 is not going to work. I don’t understand the value of set-up 2 with the same physical DB. Why do you want to have a single point of failure in a multi-dc setup? You can run two completely independent clusters, each with its DB in this case.

If you need the ability to continue already started workflows in case of DC outage, then Temporal global namespace (multi-cluster) is the only option.

A passive instance needs to be maintained incase one of the site (either Temporal or workers or both) goes down.

Temporal requires strong DB consistency. If data can be lost during DB failover, then the cluster can get into a corrupted state, which is not recoverable.

sagheerahmed · October 28, 2023, 8:51am

Thanks @maxim for your response.

In set-up 2, the DB is HA in a cluster across 2 sites. Its a Dual site Master/ Multiple Standbys using Synchronous and Asynchronous streaming replication. One of the Standby in DR is cascade Standby.
Master/Standby architecture of EDB PostgreSQL does not allow for more than a single master at once.
That’s the reason to have 2 logical DB’s for 2 independent clusters.

maxim · October 28, 2023, 11:44pm

Sorry, I’m not an expert in PostgreSQL replication. So, I don’t know if it provides the necessary consistency in case of host crashes. If it doesn’t, the cluster might get into an unrecoverable state. So please test thoroughly.

sagheerahmed · October 30, 2023, 1:24am

Thanks @maxim . Trying to understand why the clusters have to talk to each other? Isn’t the DB source of truth for them? In set-up1, its similar to having multiple replicas of each service?

What if in set-up1, the clusters can talk to each other with something like Skupper. Only if site1 temporal service is not available, call will go to other site.

maxim · November 1, 2023, 12:11am

DB is the source of truth for them, but Temporal allocates shards to history hosts. So, the direct routing to these shards is necessary. You can watch the Designing a Workflow Engine From First Principles presentation at the @Scale conference which explains the architecture of the service.

Topic		Replies	Views
Active-active deployment of temporal services on multiple kubernetes cluster cross DC Community Support deployment	1	1598	November 16, 2022
Multi-datacentre issue Community Support deployment	1	635	April 13, 2022
Configure active-active and active-failover Temporal clusters Server Deployment multicluster , deployment , server	12	1321	July 11, 2023
Run temporal cluster with all services in a high available fashion on 2 different servers Community Support	0	348	August 18, 2023
High Availability Cross Region Deployment Server Deployment cassandra , multicluster , postgresql	0	93	October 2, 2024

Temporal Multi-Server Deployment on Openshift

Related topics