Designing Fault-Tolerant Multi-Cluster Configuration Management with Temporal Workflow

Hello, I’m planning to use Temporal Workflow to design a workflow that maintains data consistency across Config Tables in different clusters.

My current architecture design is: Micro Frontend receives requests and forwards them to the Temporal Service, which starts a Workflow. The Workflow then calls each cluster’s CRUD API through Activities or ChildWorkflows.

Concept Path: Request → Micro Frontend → Temporal Service → Workflow → Call Different Cluster CRUD APIs

I have three considerations that I’d like to get architectural design advice on:

  1. When one cluster crashes, how can we ensure it doesn’t affect other clusters continuing to receive CRUD API requests?
  2. When a cluster comes back online, how can we sequentially re-execute the Workflows?
  3. When adding a new cluster, how can we make the Temporal Service call the new cluster’s CRUD API without restarting the Temporal Service?
  1. You can call CRUD APIs in parallel and keep retrying requests for the cluster that is down until it comes back.
  2. You don’t need to reexecute anything if you keep retrying while it is down.
  3. You can pass the list of clusters as workflow input (or have an activity that returns the list of clusters to update). Then executes the updates based on the list.