What is the difference between relying on Cassandra data replication vs using Temporal XDC Active/Passive setup. e.g. if Temporal cluster in one region goes down and we have a stand-by cluster in another region connected to the passive database that has replicated data, shouldn’t the workflows resume from where they were before the cluster went down? (Assuming we are able to flip the Temporal cluster url in workers as well so that they can now connect to the passive cluster). Are there any other factors like caching in the cluster or clients that might cause issues if we are to rely on such a setup?
Cassandra’s replication is not consistent. Temporal cannot operate if a database is in an inconsistent state. So it is not possible to use Cassandra data replication with Temporal.
Given that the replication is eventually consistent, what would be the side effects of flipping clients to the new cluster? e.g. if we can make sure all the data has been replicated and then only start the DR Temporal cluster pointing to the replicated database, that should work and resume workflows from where they were previously ?
If you are 100% sure that all data is replicated, then it might (not guaranteed) work. But if any of the data is not replicated the cluster will get corrupted without any way to recover.
Does changing consistency at temporal/client.go at master · temporalio/temporal · GitHub from “LOCAL_QUORUM” (which is the default temporal/persistence.go at master · temporalio/temporal · GitHub) to “QUORUM” ensure that data are replicated in all DC?
If yes, what would be the impact on temporal performance? I guess it will not be good as we introduce latency and number of write
thanks
Even if it works the performance will be really bad as Cassandra lightweight transactions need multiple roundtrips per a single update.
To summarize my understanding:
- Relying on async db replication for a multi-region active/passive Temporal solution might not be the way to go since replication delays might corrupt the cluster. This might work though if we can ensure somehow that all data has been replicated within regions before starting to use the cluster on the other side (to be explored).
- Sync replication has its own challenges with write performance which might drastically impact Temporal cluster’s throughput.
A follow-up question is how do users typically handle region failures in Temporal (since Temporal’s xdr is an experimental feature, everyone might not want to use it right now)?
Say for our use-case we have a backup region where we can trigger fresh workflows in case of a region going down, assuming re-triggering workflows that were in flight is not a concern (if we were to work around it by writing idempotent activities so that if the same workflow is triggered twice, duplicate invocations of workflows or activities should not be a concern). One big issue with this approach is how would we handle dynamically scheduled workflows or activities that are waiting to execute at some point later since the fresh cluster is not aware of those without any replication in place. Any suggestions or patterns for such scenarios?
The simplest approach is to have two separate namespaces in different regions. If one region is down, then switch starting workflows to another. One the first one come back it will continue workflows it had before. This approach works fine for fully idepmpotent activities and short workflows.