Hi Team, I am trying to deploy temporal server and for fault-tolerance in case of my any Data center goes down other data center can server the temporal server.
Question1: I want to ask question regarding database on this. I am doing deployment identical in both DC. each DC will have complete temporal server and the service , using helmchart. And i want to ask can i use a central single postgres db for both temporal server.
Below is the design, deployment i can do. I am concern, temporal should not have issue in using same postgres, as it is single source of truth.
Note: DC1 and DC2 can not communicate as they are in different DC.
Question2: I do see there is discussion around it , people should not use common postgres db , But that will end up making it slow as different cluster will try to steal it. Then what you suggest we have different postgres db for each cluster. But in that case if i have 5 cluster we end up in five cluster.
Or if you say use postgres shared for each cluster, that means we have to create more shared as much temporal cluster we have.
I don’t think this is going to work and if it won’t be very performant now fault-tolerance. I’d say your biggest concern when it comes to fault tolerance is your data (Temporal components are basically stateless). So any solution around fault-tolerance should take your data into account.
For that reason, each Temporal instance should have its own datastore. I’d also look into Temporal replication. You could have one set of namespaces active on one cluster and standby on the other (and vice versa). I think that’s the preferred setup when it comes to fault-tolerance.
So what you are saying is active-passive deployment. If i am not wrong. But i don’t see documentation around it how to achieve it.
May be what you are saying in one cluster i have everything setup which will process all request. And other cluster will be passive not doing anything, and it will have db which we can replication by some db tool. And when the active node goes down we can bring passive cluster up by some script or manual process.
i was also reading some post, are this post still send true.
And i am also reading active-active deployment is still experimental, is this is still true?
I see you mentioning (Temporal components are basically satteless).
But i see @maxim mentioned. The history nodes are stateful. So each workflow id maps to a specific node. So there is no way to prefer closer nodes without a complete architecture redesign.
Right that’s the docs. It is quite sparse and indeed lacking. It probably helps you getting of the ground and get initial replication going, but expect having to look into the source a fair bit.
I cannot speak for Temporal, but I think replication is ready for use, especially if you are using one of the later versions of Temporal. Temporal Cloud uses replication as well and I am pretty sure its based on this feature as well.
Right, so each history node handles a range of history shards. Workflows are assigned to shards by consistent hashing using namespace and workflow id. This is pretty much hard wired and cannot be changed.
What I meant with stateless is that there is no persistent data aside of the database. If one or more history pods die, Temporal will re-shard and distribute the work across the new nodes.
As maxim is saying, K8s is just a one way of deploying Temporal. The Temporal components itself form a ring and distribute work amongst them. If you can work out the networking, nothing stops you from splitting Temporal across K8s clusters. In practice I think this would be quite hard and you would need to deal with latencies between clusters, etc. I don’t believe this is a viable setup.