Replay-safe way to dynamically route activities in a multi data center setup

Problem: Replay issues caused by RPC calls inside executeActivity() of WorkflowOutboundCallsInterceptor

Current Setup

We have implemented WorkflowOutboundCallsInterceptor to override executeActivity() and dynamically determine the task queue for routing activity tasks. This is required by our multi-datacenter architecture, where activities for a user must always run in their currently assigned data center.

To determine the correct task queue, we make an RPC call inside executeActivity() to fetch the user’s current DC assignment from a source-of-truth (SOT) service.

However, since executeActivity() is invoked during replay, the RPC call is re-executed — which violates Temporal’s determinism guarantees.

Background: Multi-DC Architecture

We operate multiple Temporal clusters (one per data center) with no replication between them. Each user is assigned to a specific data center, and this assignment can change over time based on traffic distribution.

To maintain read-after-write consistency, all activities for a given user must run in their current DC — even if the workflow started in a different DC.

Example

  • A workflow starts in DC1, and activities initially run in DC1.
  • Later, the user’s assignment moves to DC2.
  • Remaining activities must now run in DC2, though the workflow itself continues running in DC1.

Worker & Task Queue Setup

Each data center runs:

  • A local Temporal cluster
  • Primary workers polling both workflow and activity queues in the local cluster
  • Cross-colo workers polling activity queues from remote clusters

Task Queues

  • DC1:
    • task-queue-dc1 — polled by local workers
    • cross-colo-task-queue-dc2 — polled by workers in DC2
  • DC2:
    • task-queue-dc2 — polled by local workers
    • cross-colo-task-queue-dc1 — polled by workers in DC1

Question:
Given the above setup, what’s the best way to:

  • Perform external lookups (like our RPC call) without breaking replay determinism?
  • Ensure that activity routing remains replay-safe?

Would wrapping the network call in Workflow.sideEffect() be appropriate? Or should we use WorkflowUnsafe.isReplaying() to guard it? Are there better alternatives?

Would appreciate any advice, patterns, or alternatives others have used in similar cross-DC setups!

Use an activity to perform the lookup.

Thanks, Maxim, for the response! I wanted to clarify a couple of things:

I came across this statement in the documentation:

“Do not ever have a Side Effect that could fail, because failure could result in the Side Effect function executing more than once. If there is any chance that the code provided to the Side Effect could fail, use an Activity.”

What are the implications if Workflow.sideEffect() re-executes due to a failure, and the wrapped code succeeds on the retry? Would that still cause issues?

Also, how does this compare to using WorkflowUnsafe.isReplaying()? Specifically, what are the risks if we wrap the lookup call so it only executes when the workflow is not replaying?

Just trying to better understand why these approaches are discouraged, and what potential problems they might they might lead to.

I tried both patterns by intentionally introducing errors in the sideEffect block and in the isReplaying() condition — in both cases, the workflow still completed successfully.

SideEffect cannot be directly retried. When it fails then the whole workflow task fails and is retried. You cannot control retry policy of workflow task failures. So we don’t recommend ever using SideEffect for calls that can fail.

WorkflowUsafe.isReplaying doesn’t help here as you need to store data loaded by the external call to be used during replay.

What is the problem of doing DC lookup in an activity?

While using a separate activity for the DC lookup is a valid and clean approach aligned with Temporal’s programming model, we’re exploring alternatives due to some practical considerations specific to our use case.

This logic needs to run before every activity in the workflow. Modeling it as a separate activity would effectively double the number of activity executions — and with that, increase the data volume stored in the database backing the Temporal service.

Currently, this dynamic routing logic is abstracted in a library alongside other setup code for Temporal, so it remains invisible to workflow authors. Exposing it as a standalone activity would surface it in the event history and could lead to confusion for consumers of the library who aren’t aware of these internals.

Our main concern is the additional data volume and workflow history verbosity. While there’s also some potential for performance overhead due to the added activities, that’s not a concern for us at this point.

Then call it directly from the workflow code, but make sure that you throw an Error (which fails the workflow task) if the API is not available. Also make sure that the latency is low to avoid tripping the deadlock detector. This assumes that this data is used only to determine the task queue in the activity options.

Yes, we use this lookup data only to determine the task queue, and it’s passed into the activity options. The lookup API is fairly fast — the P99 latency is around 7ms.

Just to clarify with a hypothetical example, did you mean Option 1 or Option 2 below?

In Option 1, the lookup API is called directly in the workflow code, so it will be re-executed during replay.
In Option 2, the call is wrapped in Workflow.sideEffect(), so it is executed only once, and the recorded result is replayed on subsequent replays — the original call won’t be made again.

In both cases, if the service throws an exception, it will fail the workflow task, and the workflow will retry indefinitely until the call succeeds.

public class RoutingService {
    public String getUserDataCenter(String userId) {
        try {
            // Make API call with fast timeout
            return externalApi.lookupUserDataCenter(userId);
        } catch (Exception e) {
            throw new RuntimeException("Routing API unavailable: " + e.getMessage(), e);
        }
    }
}

@WorkflowInterface
public interface OrderWorkflow {
    @WorkflowMethod
    void processOrder(String userId, String orderId);
}

Option 1:

public class OrderWorkflowImpl implements OrderWorkflow {
    
    private final RoutingService routingService;
    
    public OrderWorkflowImpl(RoutingService routingService) {
        this.routingService = routingService;
    }
    
    @Override
    public void processOrder(String userId, String orderId) {
        
        // First Activity - Payment Processing
        String dataCenter1 = routingService.getUserDataCenter(userId); // Direct call - executes on replay
        String taskQueue1 = getTaskQueue(dataCenter1);
        
        ActivityOptions options1 = ActivityOptions.newBuilder()
            .setTaskQueue(taskQueue1)
            .setStartToCloseTimeout(Duration.ofMinutes(2))
            .build();
            
        PaymentActivity paymentActivity = Workflow.newActivityStub(PaymentActivity.class, options1);
        String paymentResult = paymentActivity.processPayment(userId, orderId);
        
        // Second Activity - Fulfillment
        String dataCenter2 = routingService.getUserDataCenter(userId); // Direct call again - executes on replay
        String taskQueue2 = getTaskQueue(dataCenter2);
        
        ActivityOptions options2 = ActivityOptions.newBuilder()
            .setTaskQueue(taskQueue2)
            .setStartToCloseTimeout(Duration.ofMinutes(3))
            .build();
            
        FulfillmentActivity fulfillmentActivity = Workflow.newActivityStub(FulfillmentActivity.class, options2);
        fulfillmentActivity.fulfillOrder(userId, orderId, paymentResult);
    }
    
    private String getTaskQueue(String dataCenter) {
        return "order-queue-" + dataCenter;
    }
}

Option 2:

public class OrderWorkflowImpl implements OrderWorkflow {
    
    private final RoutingService routingService;
    
    public OrderWorkflowImpl(RoutingService routingService) {
        this.routingService = routingService;
    }
    
    @Override
    public void processOrder(String userId, String orderId) {
        
        // First Activity - Payment Processing
        String dataCenter1 = Workflow.sideEffect(String.class, () -> {
            return routingService.getUserDataCenter(userId); // Executes once, cached on replay
        });
        String taskQueue1 = getTaskQueue(dataCenter1);
        
        ActivityOptions options1 = ActivityOptions.newBuilder()
            .setTaskQueue(taskQueue1)
            .setStartToCloseTimeout(Duration.ofMinutes(2))
            .build();
            
        PaymentActivity paymentActivity = Workflow.newActivityStub(PaymentActivity.class, options1);
        String paymentResult = paymentActivity.processPayment(userId, orderId);
        
        // Second Activity - Fulfillment
        String dataCenter2 = Workflow.sideEffect(String.class, () -> {
            return routingService.getUserDataCenter(userId); // Executes once, cached on replay
        });
        String taskQueue2 = getTaskQueue(dataCenter2);
        
        ActivityOptions options2 = ActivityOptions.newBuilder()
            .setTaskQueue(taskQueue2)
            .setStartToCloseTimeout(Duration.ofMinutes(3))
            .build();
            
        FulfillmentActivity fulfillmentActivity = Workflow.newActivityStub(FulfillmentActivity.class, options2);
        fulfillmentActivity.fulfillOrder(userId, orderId, paymentResult);
    }
    
    private String getTaskQueue(String dataCenter) {
        return "order-queue-" + dataCenter;
    }
}

Option2 is the same as using a local activity. You are charged the same and performance difference is negligible. But retries are not as controllable.

Option1 is OK, but I would change it to not call the API during replay. Just return some bogus value as it is not used during replay anyway. Something like:

 
        String taskQueue;
        if (isReplaying()) {
           taskQueue2 = "not used";
        } else {
            String dataCenter2 = routingService.getUserDataCenter(userId); // Direct call again - executes on replay
            taskQueue2 = getTaskQueue(dataCenter2);
        }

Thank you, Maxim, for your responses and guidance — really appreciate the support!

1 Like