Event corruption with child workflows? (Java SDK)

Greetings. We’ve run across a problem you might find interesting. If we launch N workflows, then (i) wait for them to start and (ii) then wait for them to complete, the SDK reports event corruption during processing. The specific exception is:


Code workflow code below. Note, the child-workflow does nothing. I can upload full Spring/project source if it’s helpful.

We’re using Java SDK 1.4.0 against Temporal 1.12.3 (latest docker-compose using MySQL). It launches 200 child workflows. We’ve also seen the problem launching 50 workflows if those workflows do something else (e.g. send signals). Figured you’d want the simpler code.

    final int numChildren = 200;
    final Map<String, Promise<Void>> childWorkflows = new HashMap<>();
    final List<Promise<WorkflowExecution>> childStartPromises = new ArrayList<>();

    // Launch all child workflows.
    for (int i = 0; i < numChildren; i++) {
        final String childWorkflowId = UUID.randomUUID().toString() + "-child-" + i;

        final ChildWorkflow child =

        childWorkflows.put(childWorkflowId, Async.procedure(child::execute));

    // Wait for all child workflows to start.

    // Wait for all child workflows to complete.
    final List<Promise<Void>> childEndPromises = new ArrayList<>(childWorkflows.values());

Thanks - any assistance would be much appreciated!


Hi @sdonovan , looking at the workflow code, it does not look deterministic, specifically:

see constraints here . You should use



Another thing to look into is looping over unordered collections. This can also cause non-deterministic behavior during replay. Make sure that you use ordered collections.

Some other rules that don’t apply to your code but just to add:

  • Don’t use explicit synchronization in your workflow code.

  • You can use non-static fields in your workflow definition without having to worry about isolation issues.

  • For static fields use io.temporal.workflow.WorkflowLocal or io.temporal.workflow.WorkflowThreadLocal depending on your use case You can use Atomic variables, but not really needed.

  • Don’t use synchronized lists, as that will break workflow determinism

Just replaced HashMap with LinkedHashMap, and used Workflow.randomUUID() and . . . it’s working.

Thank you!

Glad it’s working. I also think you can simplify your code:

final int numChildren = 200;
    List<Promise<Void>> childPromises = new ArrayList<>();
    // Launch all child workflows.
    for (int i = 0; i < numChildren; i++) {
      final String childWorkflowId = Workflow.randomUUID().toString() + "-child-" + i;

      final ChildWorkflow child =

      childPromises.add(Async.procedure(child:: execute));


parent workflow should wait until all children have completed in this case.

Thanks, it’s OK. In our case, we specifically need to know when the child workflows have started, such that we can send signals to them.

I am getting a different error with same scenario for the below code. I am using a list to collect the executions and workflow results, so I hope determinism is not a problem here.

List<Promise<WorkflowExecution>> executionResults = new ArrayList<>();
List<Promise<Void>> results = new ArrayList<>();

for(int i=0 ;i< 150; i++) {
    ChildWorkflow childWf =
            Workflow.newChildWorkflowStub(ChildWorkflow.class, ChildWorkflowOptions.newBuilder()



//wait for all childworkflows to get spawned
//wait for all childworkflows to complete

If you examine the events are after elapsed time 6s, you can see a WorkflowTaskTimedOut with ScheduleToStart
I believe this belongs to Promise.allOf(executionResults).get();

What is the error you are getting? ParentClosePolicy doesn’t apply here as you are waiting for all child workflows to complete.

I am seeing a workflowtasktimeout. ScheduletoStart

Is this the last event in the history?

No, it happens in an event at around 6s

It is benign then. Is workflow completed OK?

Workflow completed all okay.

I don’t know if it is by design - it schedules, fails and reschedule again until it success in this case. Just wanted to see if this is normal.

Yes, it is by design. It looks like the worker had failed or was busy. So the workflow task that was scheduled just for that worker wasn’t picked up fast enough and was rescheduled to be picked up by another worker. See the post for the details of this mechanism.

1 Like

//When a task in a host specific task queue times out it is immediately rescheduled to a shared task list for other hosts to pick up.//
Now all the puzzle pieces fit for me. Thanks very much.
Since there was a task failure in sticky queue, temporal rescheduled and replayed the workflow which had a bug in determinism and caused the other issue.

1 Like

I did a sysout and saw that workflow execution is indeed replayed.
Thinking about it, it was my local mac just executing this workflow. Worker was not too busy. If this is going to be a common case, worker cache will be under-utilized.

It is certainly not a common case unless something is broken in your local setup.

This is what I am trying with docker compose v1.12. Not sure what is wrong in the setup. If you get a chance, please give it a try. It seems to me that Promise.get() is scheduled way up in the chain of events and timeout here is inevitable unless prmoise.get() is scheduled at appropriate time.

I’m able to run the code you posted without any problem.