Java: Potential deadlock detected while spawning child workflows in a loop

Hi there,

We are running self-hosted Temporal (server version 1.20.4) using the Java SDK (version 1.20.0).

For one of our workflows, I have been encountering “Potential deadlock detected” exceptions in our production environment (not while debugging).

The workflow method looks roughly like this. Note that it spawns the child workflows in a loop, then waits for them all to finish. In our production environment there are 75 children at the moment:

List<MyJob> jobs = activities.getMyJobs();
List<Promise<MyResult>> promiseList = new ArrayList<>();
for (MyJob job : jobs) {
  MyChild child = Workflow.newChildWorkflowStub(MyChild.class, ChildWorkflowOptions.newBuilder()...);
  promiseList.add(Async.function(child::execute, job));
// Wait for all the children to finish

I eventually determined the root cause is that Temporal is attempting to spawn all the child workflows before emitting any of the events to the workflow’s event log, BUT, the spawns are too slow, so it was only spawning ~50 out of the 75 children before running into the deadlock detection timeout.

Example screenshot from prod (we had increased the timeout to 5 seconds at that time):

I was able to work around this by adding a Workflow.sleep(1); just after spawning each child, which forced Temporal’s hand to emit the child-spawn workflow events incrementally, interleaved with timer events of course:

(Apologies, will upload the second screenshot in a comment. New users are only permitted to upload 1 screenshot per post.)

I deployed to prod and verified that it fixed the problem.

My question is: is there a better way to handle this?

Second screenshot is:

Hi @mhalverson

starting 75 child workflows at the time shouldn’t be an issue if you are under the 4MB grpc limit System limits - Temporal Cloud | Temporal Documentation

Do you need all child to be scheduled at the same time? maybe you can think in a different implementation like starting the child workflows in batches , see this samples-java/core/src/main/java/io/temporal/samples/batch/iterator at main · temporalio/samples-java · GitHub or this samples-java/core/src/main/java/io/temporal/samples/batch/slidingwindow at main · temporalio/samples-java · GitHub example,

let us know if it helps,

Hi @antonio.perez ,

Good to know that it is reasonable to start 75 child workflows. Our payloads in this case are very small, nowhere near 4MB.

Thanks for linking the sample code. Structurally, mine is identical to IteratorBatchWorkflowImpl, except without the batching. i.e. it’s 1 batch with all 75 child workflows. Unfortunately batching is not an option for my use case.

Perhaps the SDK could be changed to emit the child-spawn events incrementally, rather than spooling them and emitting them in one big group at the end. Otherwise, I will leave my existing Workflow.sleep(1) hack in place.