Event corruption with child workflows? (Java SDK)

sdonovan · October 16, 2021, 5:24pm

Greetings. We’ve run across a problem you might find interesting. If we launch N workflows, then (i) wait for them to start and (ii) then wait for them to complete, the SDK reports event corruption during processing. The specific exception is:

Caused by: java.lang.IllegalStateException: COMMAND_TYPE_START_CHILD_WORKFLOW_EXECUTION doesn't match EVENT_TYPE_START_CHILD_WORKFLOW_EXECUTION_INITIATED with EventId=5

Code workflow code below. Note, the child-workflow does nothing. I can upload full Spring/project source if it’s helpful.

We’re using Java SDK 1.4.0 against Temporal 1.12.3 (latest docker-compose using MySQL). It launches 200 child workflows. We’ve also seen the problem launching 50 workflows if those workflows do something else (e.g. send signals). Figured you’d want the simpler code.

    final int numChildren = 200;
    final Map<String, Promise<Void>> childWorkflows = new HashMap<>();
    final List<Promise<WorkflowExecution>> childStartPromises = new ArrayList<>();

    // Launch all child workflows.
    for (int i = 0; i < numChildren; i++) {
        final String childWorkflowId = UUID.randomUUID().toString() + "-child-" + i;

        final ChildWorkflow child =
            Workflow.newChildWorkflowStub(
                ChildWorkflow.class,
                ChildWorkflowOptions.newBuilder()
                    .setWorkflowId(childWorkflowId)
                    .setCancellationType(WAIT_CANCELLATION_COMPLETED)
                    .setTaskQueue("test")
                    .setWorkflowTaskTimeout(Duration.ofSeconds(60))
                    .build());

        childWorkflows.put(childWorkflowId, Async.procedure(child::execute));
        childStartPromises.add(Workflow.getWorkflowExecution(child));
    }

    // Wait for all child workflows to start.
    Promise.allOf(childStartPromises).get();

    // Wait for all child workflows to complete.
    final List<Promise<Void>> childEndPromises = new ArrayList<>(childWorkflows.values());
    Promise.allOf(childEndPromises).get();
}

Thanks - any assistance would be much appreciated!

Sean

tihomir · October 16, 2021, 5:40pm

Hi @sdonovan , looking at the workflow code, it does not look deterministic, specifically:

see constraints here . You should use

Workflow.randomUUID().toString()

instead.

tihomir · October 16, 2021, 5:47pm

Another thing to look into is looping over unordered collections. This can also cause non-deterministic behavior during replay. Make sure that you use ordered collections.

Some other rules that don’t apply to your code but just to add:

Don’t use explicit synchronization in your workflow code.
You can use non-static fields in your workflow definition without having to worry about isolation issues.
For static fields use io.temporal.workflow.WorkflowLocal or io.temporal.workflow.WorkflowThreadLocal depending on your use case You can use Atomic variables, but not really needed.
Don’t use synchronized lists, as that will break workflow determinism

sdonovan · October 16, 2021, 6:02pm

Just replaced HashMap with LinkedHashMap, and used Workflow.randomUUID() and . . . it’s working.

Thank you!

tihomir · October 16, 2021, 6:15pm

Glad it’s working. I also think you can simplify your code:

final int numChildren = 200;
    List<Promise<Void>> childPromises = new ArrayList<>();
    // Launch all child workflows.
    for (int i = 0; i < numChildren; i++) {
      final String childWorkflowId = Workflow.randomUUID().toString() + "-child-" + i;

      final ChildWorkflow child =
          Workflow.newChildWorkflowStub(
              ChildWorkflow.class,
              ChildWorkflowOptions.newBuilder()
                  .setWorkflowId(childWorkflowId)
                  .setCancellationType(ChildWorkflowCancellationType.WAIT_CANCELLATION_COMPLETED)
                  .setTaskQueue("test")
                  .setWorkflowTaskTimeout(Duration.ofSeconds(60))
                  .build());

      childPromises.add(Async.procedure(child:: execute));
    }

    Promise.allOf(childPromises).get();

parent workflow should wait until all children have completed in this case.

sdonovan · October 16, 2021, 6:16pm

Thanks, it’s OK. In our case, we specifically need to know when the child workflows have started, such that we can send signals to them.

sp13 · October 16, 2021, 9:07pm

I am getting a different error with same scenario for the below code. I am using a list to collect the executions and workflow results, so I hope determinism is not a problem here.

List<Promise<WorkflowExecution>> executionResults = new ArrayList<>();
List<Promise<Void>> results = new ArrayList<>();

for(int i=0 ;i< 150; i++) {
    ChildWorkflow childWf =
            Workflow.newChildWorkflowStub(ChildWorkflow.class, ChildWorkflowOptions.newBuilder()
                    .setParentClosePolicy(ParentClosePolicy.PARENT_CLOSE_POLICY_ABANDON)
                    .setWorkflowId("workflow-"+i)
                    .build());

    results.add(Async.procedure(childWf::execute));
    executionResults.add(Workflow.getWorkflowExecution(childWf));

}

//wait for all childworkflows to get spawned
Promise.allOf(executionResults).get();
//wait for all childworkflows to complete
Promise.allOf(results).get();

If you examine the events are after elapsed time 6s, you can see a WorkflowTaskTimedOut with ScheduleToStart
I believe this belongs to Promise.allOf(executionResults).get();

maxim · October 16, 2021, 11:27pm

What is the error you are getting? ParentClosePolicy doesn’t apply here as you are waiting for all child workflows to complete.

sp13 · October 16, 2021, 11:28pm

I am seeing a workflowtasktimeout. ScheduletoStart

maxim · October 16, 2021, 11:29pm

Is this the last event in the history?

sp13 · October 16, 2021, 11:30pm

No, it happens in an event at around 6s

maxim · October 16, 2021, 11:31pm

It is benign then. Is workflow completed OK?

sp13 · October 16, 2021, 11:32pm

Workflow completed all okay.

sp13 · October 16, 2021, 11:42pm

I don’t know if it is by design - it schedules, fails and reschedule again until it success in this case. Just wanted to see if this is normal.

maxim · October 16, 2021, 11:45pm

Yes, it is by design. It looks like the worker had failed or was busy. So the workflow task that was scheduled just for that worker wasn’t picked up fast enough and was rescheduled to be picked up by another worker. See the post for the details of this mechanism.

sp13 · October 16, 2021, 11:55pm

//When a task in a host specific task queue times out it is immediately rescheduled to a shared task list for other hosts to pick up.//
Now all the puzzle pieces fit for me. Thanks very much.
Since there was a task failure in sticky queue, temporal rescheduled and replayed the workflow which had a bug in determinism and caused the other issue.

sp13 · October 17, 2021, 12:02am

I did a sysout and saw that workflow execution is indeed replayed.
Thinking about it, it was my local mac just executing this workflow. Worker was not too busy. If this is going to be a common case, worker cache will be under-utilized.

maxim · October 17, 2021, 12:27am

It is certainly not a common case unless something is broken in your local setup.

sp13 · October 18, 2021, 2:30pm

This is what I am trying with docker compose v1.12. Not sure what is wrong in the setup. If you get a chance, please give it a try. It seems to me that Promise.get() is scheduled way up in the chain of events and timeout here is inevitable unless prmoise.get() is scheduled at appropriate time.

github.com

sp13ceg/samples-java/blob/master/src/main/java/io/temporal/samples/hello/ParentChildWorkflow.java

package io.temporal.samples.hello;

import io.temporal.api.common.v1.WorkflowExecution;
import io.temporal.api.enums.v1.ParentClosePolicy;
import io.temporal.client.WorkflowClient;
import io.temporal.client.WorkflowOptions;
import io.temporal.serviceclient.WorkflowServiceStubs;
import io.temporal.worker.Worker;
import io.temporal.worker.WorkerFactory;
import io.temporal.worker.WorkerFactoryOptions;
import io.temporal.workflow.*;

import java.util.ArrayList;
import java.util.List;

public class ParentChildWorkflow {

    // Define the task queue name
    static final String TASK_QUEUE = "HelloActivityTaskQueue";

This file has been truncated. show original

maxim · October 18, 2021, 4:18pm

I’m able to run the code you posted without any problem.

Topic		Replies	Views
Java: Potential deadlock detected while spawning child workflows in a loop Community Support	3	400	January 16, 2024
Expection on child workflow execution Community Support java-sdk	2	198	April 4, 2024
Parent workflow fails to process child completion, Temporal state machine calls wrong method Community Support java-sdk	1	37	March 27, 2025
Child-workflows + Signals Community Support java-sdk	7	3725	October 16, 2021
Duplicates in child workflows Community Support java-sdk , cadence	2	2066	August 5, 2020

Event corruption with child workflows? (Java SDK)

Related topics