Possible to use Cassandra data replication for Active/Passive Temporal setup?

sgupta · November 29, 2022, 1:02am

What is the difference between relying on Cassandra data replication vs using Temporal XDC Active/Passive setup. e.g. if Temporal cluster in one region goes down and we have a stand-by cluster in another region connected to the passive database that has replicated data, shouldn’t the workflows resume from where they were before the cluster went down? (Assuming we are able to flip the Temporal cluster url in workers as well so that they can now connect to the passive cluster). Are there any other factors like caching in the cluster or clients that might cause issues if we are to rely on such a setup?

maxim · November 29, 2022, 4:20am

Cassandra’s replication is not consistent. Temporal cannot operate if a database is in an inconsistent state. So it is not possible to use Cassandra data replication with Temporal.

sgupta · November 29, 2022, 11:42pm

Given that the replication is eventually consistent, what would be the side effects of flipping clients to the new cluster? e.g. if we can make sure all the data has been replicated and then only start the DR Temporal cluster pointing to the replicated database, that should work and resume workflows from where they were previously ?

maxim · November 30, 2022, 1:38am

If you are 100% sure that all data is replicated, then it might (not guaranteed) work. But if any of the data is not replicated the cluster will get corrupted without any way to recover.

nicolas_meylan · December 5, 2022, 3:26pm

Does changing consistency at temporal/client.go at master · temporalio/temporal · GitHub from “LOCAL_QUORUM” (which is the default temporal/persistence.go at master · temporalio/temporal · GitHub) to “QUORUM” ensure that data are replicated in all DC?

If yes, what would be the impact on temporal performance? I guess it will not be good as we introduce latency and number of write

thanks

maxim · December 5, 2022, 6:04pm

Even if it works the performance will be really bad as Cassandra lightweight transactions need multiple roundtrips per a single update.

sgupta · December 6, 2022, 12:39am

To summarize my understanding:

Relying on async db replication for a multi-region active/passive Temporal solution might not be the way to go since replication delays might corrupt the cluster. This might work though if we can ensure somehow that all data has been replicated within regions before starting to use the cluster on the other side (to be explored).
Sync replication has its own challenges with write performance which might drastically impact Temporal cluster’s throughput.

A follow-up question is how do users typically handle region failures in Temporal (since Temporal’s xdr is an experimental feature, everyone might not want to use it right now)?

Say for our use-case we have a backup region where we can trigger fresh workflows in case of a region going down, assuming re-triggering workflows that were in flight is not a concern (if we were to work around it by writing idempotent activities so that if the same workflow is triggered twice, duplicate invocations of workflows or activities should not be a concern). One big issue with this approach is how would we handle dynamically scheduled workflows or activities that are waiting to execute at some point later since the fresh cluster is not aware of those without any replication in place. Any suggestions or patterns for such scenarios?

maxim · December 6, 2022, 2:10am

The simplest approach is to have two separate namespaces in different regions. If one region is down, then switch starting workflows to another. One the first one come back it will continue workflows it had before. This approach works fine for fully idepmpotent activities and short workflows.

Topic		Replies	Views
Business continuity in the case of a regional outage Community Support cassandra , multicluster	14	1633	September 1, 2022
What does one loose by depending upon the datbase replication, instead of temporal server xdc replication? Community Support xdc	1	575	August 6, 2021
Active-active deployment of temporal services on multiple kubernetes cluster cross DC Community Support deployment	1	1648	November 16, 2022
High Availability Cross Region Deployment Server Deployment cassandra , multicluster , postgresql	0	104	October 2, 2024
XDC Limitations and Tradeoffs Community Support java-sdk , xdc	0	564	January 12, 2022

Possible to use Cassandra data replication for Active/Passive Temporal setup?

Related topics