Routing between Temporal-Clusters -- is this possible?

Greetings. Hopefully I can explain this series of questions clearly.

Imagine two namespaces, N1 and N2, with workflows in each. Workflows in N1 will start child workflows in N2, but not the reverse. When starting a child workflow from N1 for instantiation in N2, and calling from within an activity, we: (i) set the WorkflowClientOptions to the correct target namespace (N2) and (ii) set the correct task-list when instantiating the workflow. When the workflow in N2 completes, an event is written to the parent workflow in N1. I assume this works.

Is it possible to run N1 and N2 on separate instances of Temporal (I1 and I2), i.e. as a cluster, using Kafka or whatever between them (see diagram attached)? Specifically, from N1, can I start a workflow in N2, and get usual event notification when it’s complete?

Going further: imagine the activity in I1 that starts the workflow in I2 has a doNotComplete() and sends it’s activity task-token to the workflow in I2, either as workflow input or a signal. When the task-token is completed within I2, does it get routed to originating activity in I1?

On the diagram, I’m assuming the red-line use-cases are not supported in the cluster case (I’m assuming call routing is done by namespace – if it is supported at all).

Many thanks!

Sean

In the future, we plan to support child workflow and activity invocations directly between clusters.

Now it is not directly supported. So the recommended workaround is to start a workflow in a separate cluster through an activity and report workflow completion from activity executed by the “child” workflow that signals the original workflow.

I would recommend to use an activity to start “child” and the child to signal back to complete over an asynchronous activity. The reason is timeouts and retries. If remote cluster is not reachable for a short period of time you want to retry the StartWorkflowExecution call almost immediately. In case of asynchronous activity the timeout is expected to be as long as the whole “child” workflow execution. So retry is not going to happen for a long time.

Ah, I get it – we view them as totally separate installations, not clustered at all, and communicate with async/signals. The downside is that activity code in I2 needs url/port access to I1 – though, maybe we find a solution to that. If the cluster case was supported, communication is limited to just the Temporal instances.

Thank you!

Sean

1 Like

You can always move those specific activities in its own binary and make them part of the core cluster install.

Yup, that’s a good idea!