Hi everyone.
Since Temporal is getting more critical within our organization with an increasing volume of usage and dependency, we’re looking for a way to include Temporal deployment in our business continuity implementation by ensuring high availability
We’ve thought of many approaches, one of which is the current multi-cluster replication feature with a standby cluster in the other DC. This comes with the caveat of possible data loss when failing over, conflict resolution and others.
Since resource utilization is important, instead of having a standby cluster ready with no direct usage we’re thinking of doing the following:
- Deploy a Cassandra cluster across two data centers with maximum consistency
- Deploy a Temporal Server in the primary data center
- In the case of the primary site being down, we will be spinning a new temporal server in the secondary data center connecting to the same Cassandra cluster
Since Temporal architecture inherits durability through relying on persistence storage (Cassandra in this case or might be a highly consistent postgres cluster) the new temporal server should pick up with no issues
Hypothetically, such a design will ensure zero RPO which is what we’re aiming for, while also having fewer resource requirements given that no standby cluster is running all the time.
Would appreciate the Temporal team input or anyone who had his fair share of HA deployments
Thanks