Postgres stretched cluster along with Temporal stretched cluster across 2 DCs with ~7ms latency

Dear Temporal team,

I went through a lot of threads related to cross DCs setup but still need some advices…

Let me start with 2 requirements/constraints:

  • data loss is not affordable so data consistency is the priority
  • RTO has to be close to 0 but still, less important than consistency

Given that:

  • we have 2 datacenters (not in public cloud)
  • the latency between our 2 DCs is less than 10ms (~6ms avg)
  • Postgres stretch cluster with synchronous replication is available (we use a 3rd DC for etcd for the quorum to avoid split brain)

My understanding is that:

  • Because of the strong consistency requirement, multi-cluster replication is not an option
  • It is NOT possible to have 2 separate Temporal clusters (up and active/active or active/passive) with the same database

So, we would like to setup a Temporal stretched cluster…

In this scenario, we have a single Temporal Cluster spanning across our 2 DCs.
The whole cluster relies on our Postgres stretched cluster for persistence (it would also work with Cassandra is need be).

Question:

  1. Is this a viable option?
  2. With only 2 DCs for Temporal, is it split brain resistant. In this case, I am assuming that the DB is still up and running but the communication between the Temporal nodes in both DCs is broken but the nodes on both sides can read/write into the DB.
    Would you have an idea of what would happen…
  3. If not split brain resistant, is there a way to make it split brain resistant with 2 DCs or any thought on how to mitigate the risk?

It seems that this is what’s discussed in this thread (not 100% sure): temporal-with-mysql-how-to-survive-master-slave-failovers

Many thanks in advance for your support.
Best regards, Seb.

I don’t know anything about Postgres synchronous replication. But my understanding of sync replication is that if any of the nodes is not available then the whole cluster is unavailable. Is it something you really want?

That’s not the way it’s configured on our end. It’s a master slave setup, single master multiple replicas. As long as there’s enough nodes to acknowledge the writes then the transaction completes. Cross DC is queued in case of error.
See: GitHub - zalando/patroni: A template for PostgreSQL High Availability with Etcd, Consul, ZooKeeper, or Kubernetes
Seb.

  1. Is this a viable option?

I don’t think so. Temporal assumes that any of the DB nodes can take update traffic. It looks like Postres synchronous replication doesn’t support this.

  1. With only 2 DCs for Temporal, is it split brain resistant. In this case, I am assuming that the DB is still up and running but the communication between the Temporal nodes in both DCs is broken but the nodes on both sides can read/write into the DB.
    Would you have an idea of what would happen…

I have an idea. You will end up with the corrupted DB as soon as the connection between nodes is reestablished.

  1. If not split brain resistant, is there a way to make it split brain resistant with 2 DCs or any thought on how to mitigate the risk?

I don’t think it is possible.

Thanks Maxime,

I don’t think so. Temporal assumes that any of the DB nodes can take update traffic. It looks like Postres synchronous replication doesn’t support this.

All the Temporal nodes would be connected to the same master. When a new master is elected, all the existing connections are closed and each Temporal node would reconnect to the new master.

What we would need is a “patroni like” mechanism for Temporal to activate/passivate the Temporal nodes in the different DCs automatically. The master DC could be elected (using etcd for instance) and this “patroni like” layer would start/stop the Temporal nodes in the different DCs as per the leader election result…

Conclusion:
We have only 2 DCs (with low latency) at the moment and because of the potential split brain issue, the remaining option is to go with active/passive for Temporal on top of our PG stretched cluster for the persistence.
We will start failing over Temporal manually in case of disaster/big crash.
We will work internally to see whether we have the capacity and how best we can do to automate the failover on top of etcd…

Seb.

I see. The setup that keeps only one of the clusters running at a time on top of fully consistent replicated DB sounds feasible.