Hi there,
We are running self-hosted Temporal (server version 1.20.4) using the Java SDK (version 1.20.0).
For one of our workflows, I have been encountering “Potential deadlock detected” exceptions in our production environment (not while debugging).
The workflow method looks roughly like this. Note that it spawns the child workflows in a loop, then waits for them all to finish. In our production environment there are 75 children at the moment:
List<MyJob> jobs = activities.getMyJobs();
List<Promise<MyResult>> promiseList = new ArrayList<>();
for (MyJob job : jobs) {
MyChild child = Workflow.newChildWorkflowStub(MyChild.class, ChildWorkflowOptions.newBuilder()...);
promiseList.add(Async.function(child::execute, job));
}
// Wait for all the children to finish
Promise.allOf(promiseList).get();
I eventually determined the root cause is that Temporal is attempting to spawn all the child workflows before emitting any of the events to the workflow’s event log, BUT, the spawns are too slow, so it was only spawning ~50 out of the 75 children before running into the deadlock detection timeout.
Example screenshot from prod (we had increased the timeout to 5 seconds at that time):
I was able to work around this by adding a Workflow.sleep(1);
just after spawning each child, which forced Temporal’s hand to emit the child-spawn workflow events incrementally, interleaved with timer events of course:
(Apologies, will upload the second screenshot in a comment. New users are only permitted to upload 1 screenshot per post.)
I deployed to prod and verified that it fixed the problem.
My question is: is there a better way to handle this?