Greetings. We’ve run across a problem you might find interesting. If we launch N workflows, then (i) wait for them to start and (ii) then wait for them to complete, the SDK reports event corruption during processing. The specific exception is:
Caused by: java.lang.IllegalStateException: COMMAND_TYPE_START_CHILD_WORKFLOW_EXECUTION doesn't match EVENT_TYPE_START_CHILD_WORKFLOW_EXECUTION_INITIATED with EventId=5
Code workflow code below. Note, the child-workflow does nothing. I can upload full Spring/project source if it’s helpful.
We’re using Java SDK 1.4.0 against Temporal 1.12.3 (latest docker-compose using MySQL). It launches 200 child workflows. We’ve also seen the problem launching 50 workflows if those workflows do something else (e.g. send signals). Figured you’d want the simpler code.
final int numChildren = 200;
final Map<String, Promise<Void>> childWorkflows = new HashMap<>();
final List<Promise<WorkflowExecution>> childStartPromises = new ArrayList<>();
// Launch all child workflows.
for (int i = 0; i < numChildren; i++) {
final String childWorkflowId = UUID.randomUUID().toString() + "-child-" + i;
final ChildWorkflow child =
Workflow.newChildWorkflowStub(
ChildWorkflow.class,
ChildWorkflowOptions.newBuilder()
.setWorkflowId(childWorkflowId)
.setCancellationType(WAIT_CANCELLATION_COMPLETED)
.setTaskQueue("test")
.setWorkflowTaskTimeout(Duration.ofSeconds(60))
.build());
childWorkflows.put(childWorkflowId, Async.procedure(child::execute));
childStartPromises.add(Workflow.getWorkflowExecution(child));
}
// Wait for all child workflows to start.
Promise.allOf(childStartPromises).get();
// Wait for all child workflows to complete.
final List<Promise<Void>> childEndPromises = new ArrayList<>(childWorkflows.values());
Promise.allOf(childEndPromises).get();
}
Thanks - any assistance would be much appreciated!
Another thing to look into is looping over unordered collections. This can also cause non-deterministic behavior during replay. Make sure that you use ordered collections.
Some other rules that don’t apply to your code but just to add:
Don’t use explicit synchronization in your workflow code.
You can use non-static fields in your workflow definition without having to worry about isolation issues.
I am getting a different error with same scenario for the below code. I am using a list to collect the executions and workflow results, so I hope determinism is not a problem here.
List<Promise<WorkflowExecution>> executionResults = new ArrayList<>();
List<Promise<Void>> results = new ArrayList<>();
for(int i=0 ;i< 150; i++) {
ChildWorkflow childWf =
Workflow.newChildWorkflowStub(ChildWorkflow.class, ChildWorkflowOptions.newBuilder()
.setParentClosePolicy(ParentClosePolicy.PARENT_CLOSE_POLICY_ABANDON)
.setWorkflowId("workflow-"+i)
.build());
results.add(Async.procedure(childWf::execute));
executionResults.add(Workflow.getWorkflowExecution(childWf));
}
//wait for all childworkflows to get spawned
Promise.allOf(executionResults).get();
//wait for all childworkflows to complete
Promise.allOf(results).get();
If you examine the events are after elapsed time 6s, you can see a WorkflowTaskTimedOut with ScheduleToStart
I believe this belongs to Promise.allOf(executionResults).get();
Yes, it is by design. It looks like the worker had failed or was busy. So the workflow task that was scheduled just for that worker wasn’t picked up fast enough and was rescheduled to be picked up by another worker. See the post for the details of this mechanism.
//When a task in a host specific task queue times out it is immediately rescheduled to a shared task list for other hosts to pick up.//
Now all the puzzle pieces fit for me. Thanks very much.
Since there was a task failure in sticky queue, temporal rescheduled and replayed the workflow which had a bug in determinism and caused the other issue.
I did a sysout and saw that workflow execution is indeed replayed.
Thinking about it, it was my local mac just executing this workflow. Worker was not too busy. If this is going to be a common case, worker cache will be under-utilized.
This is what I am trying with docker compose v1.12. Not sure what is wrong in the setup. If you get a chance, please give it a try. It seems to me that Promise.get() is scheduled way up in the chain of events and timeout here is inevitable unless prmoise.get() is scheduled at appropriate time.