Active-active deployment of temporal services on multiple kubernetes cluster cross DC

Hello Temporal team,

I have already read following pages:

We want to achieve High Availibility for temporal, meaning we can afford to have a DC down for a certain amount of time without impacting our business.

Below the deployment topology (in red an issue I detail later)

Temporal services will be able to communicate to each other as is they were deployed in the same DC, i won’t detail this part.

Now my questions:

  • Is this kind of deployment is suported by temporal?
  • If yes, Can you confirm that if kubernetes DC1 and DC2 are not able to reach DC3, but DC3 is still running, pods still connected to database and serving trafic, it can’t corrupt any state? are shards rebalanced? can you explain what could happen if it is not the case?

Maybe this approach is not valid and that is why What is Multi-Cluster Replication? | Temporal Documentation is about asynchronous replication with an active-passive model. If there are resources on why it is not valid I would be glad if you can share

Regarding multi-cluster replication, is it still in experimental mode?
If still expiremental, without using this feature, what would be the way to support our high availiblity requirement described above?
If still expiremental, what is remaining before being ready for production?

Finally, I am wondering how temporal cloud is addressing high availibility, do you have any word on this?

Thanks for your time

If every pod can talk to all other pods, including pods in other DCs, then Temporal is going to function. I cannot confirm the DB behavior in the case of outages as it is very DB specific. Temporal requires that DB is fully consistent in the presence of any failure. So all replication should always be synchronous.

Some caveats with the approach you propose:

  • If temporal pods get partitioned but still can perform DB writes, performance will be very bad as they will steal history shards from each other on almost every request.
  • Even if all DCs are fully operational, performance might suffer as Temporal will make multiple cross DC calls for every update. For example, an update can land on a frontend in DC1, history service in DC2, matching engine in DC3, and poll request from DC1 again.

Regarding multi-cluster replication, is it still in experimental mode?

Yes, it is in the experimental mode.

If still experimental, without using this feature, what would be the way to support our high availability requirement described above?

There is no real solution at this point.

If still experimental, what is remaining before being ready for production?

Many issues are related to correctness. Also, operating and setting it up is pretty hard. It is also not thoroughly tested.

1 Like