Non Determinism Error

Hi ,

I’m facing non determinism error at times in our temporal workflow.

io.temporal.worker.NonDeterministicException: Failure handling event 922 of type 'EVENT_TYPE_TIMER_STARTED' during replay. Event 922 of type EVENT_TYPE_TIMER_STARTED does not match command type COMMAND_TYPE_REQUEST_CANCEL_ACTIVITY_TASK. {PreviousStartedEventId=1283, WorkflowTaskStartedEventId=1294, CurrentStartedEventId=920}
	at i.t.i.s.WorkflowStateMachines.handleCommandEvent(WorkflowStateMachines.java:373)
	at i.t.i.s.WorkflowStateMachines.handleSingleEvent(WorkflowStateMachines.java:269)
	at i.t.i.s.WorkflowStateMachines.handleEventsBatch(WorkflowStateMachines.java:240)
	at i.t.i.s.WorkflowStateMachines.handleEvent(WorkflowStateMachines.java:216)
	at i.t.i.r.ReplayWorkflowRunTaskHandler.handleWorkflowTaskImpl(ReplayWorkflowRunTaskHandler.java:193)
	at i.t.i.r.ReplayWorkflowRunTaskHandler.handleWorkflowTask(ReplayWorkflowRunTaskHandler.java:140)
	at i.t.i.r.ReplayWorkflowTaskHandler.handleWorkflowTaskWithQuery(ReplayWorkflowTaskHandler.java:129)
	at i.t.i.r.ReplayWorkflowTaskHandler.handleWorkflowTask(ReplayWorkflowTaskHandler.java:98)
	at i.t.i.w.WorkflowWorker$TaskHandlerImpl.handleTask(WorkflowWorker.java:295)
	at i.t.i.w.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:239)
	at i.t.i.w.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:180)
	at i.t.i.worker.PollTaskExecutor.lambda$process$0(PollTaskExecutor.java:93)
	... 3 common frames omitted
Wrapped by: java.lang.RuntimeException: Failure processing workflow task. WorkflowId=scheduled-release-workflow-2023-05-16T22:04:41.713941-3531-25166956, RunId=bd923ea7-0e11-452f-8a76-aef33052995a, Attempt=4
	at i.t.i.w.WorkflowWorker$TaskHandlerImpl.wrapFailure(WorkflowWorker.java:286)
	at i.t.i.w.WorkflowWorker$TaskHandlerImpl.wrapFailure(WorkflowWorker.java:180)
	at i.t.i.worker.PollTaskExecutor.lambda$process$0(PollTaskExecutor.java:98)
	at j.b.u.c.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at j.b.u.c.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)```

Our use case involves sleeping workflow at regular intervals. From the error, it looks like something is going wrong with respect to timers during the workflow replay. Any help here is much appreciated. Thank you!

How do you calculate the delay?

I’m maintaining a priority queue of localdatetimes. Workflow has a while loop running on this queue which pops the queue and if the value in future time, workflow will be put to sleep till then. Error says timer and command type cancel activity task mismatched, Is there a standard way to call cancel activity calls that does not get affected in replay? As of now, I’m using longRunningCancellationScope to call the activity which needs cancellation in workflow implementation.

What API do you use to get the current time?

We use

Instant.ofEpochMilli(Workflow.currentTimeMillis()).atZone(
                    ZoneId.of("UTC")).toLocalDateTime();

to get the current UTC time.

{
  "message": "Failure handling event 405 of type 'EVENT_TYPE_TIMER_STARTED' during replay. Event 405 of type EVENT_TYPE_TIMER_STARTED does not match command type COMMAND_TYPE_COMPLETE_WORKFLOW_EXECUTION. {PreviousStartedEventId=546, WorkflowTaskStartedEventId=551, CurrentStartedEventId=403}",
  "source": "JavaSDK",
  "stackTrace": "io.temporal.internal.statemachines.WorkflowStateMachines.handleCommandEvent(WorkflowStateMachines.java:373)\nio.temporal.internal.statemachines.WorkflowStateMachines.handleSingleEvent(WorkflowStateMachines.java:269)\nio.temporal.internal.statemachines.WorkflowStateMachines.handleEventsBatch(WorkflowStateMachines.java:240)\nio.temporal.internal.statemachines.WorkflowStateMachines.handleEvent(WorkflowStateMachines.java:216)\nio.temporal.internal.replay.ReplayWorkflowRunTaskHandler.handleWorkflowTaskImpl(ReplayWorkflowRunTaskHandler.java:193)\nio.temporal.internal.replay.ReplayWorkflowRunTaskHandler.handleWorkflowTask(ReplayWorkflowRunTaskHandler.java:140)\nio.temporal.internal.replay.ReplayWorkflowTaskHandler.handleWorkflowTaskWithQuery(ReplayWorkflowTaskHandler.java:129)\nio.temporal.internal.replay.ReplayWorkflowTaskHandler.handleWorkflowTask(ReplayWorkflowTaskHandler.java:98)\nio.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.handleTask(WorkflowWorker.java:295)\nio.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:239)\nio.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:180)\nio.temporal.internal.worker.PollTaskExecutor.lambda$process$0(PollTaskExecutor.java:93)\njava.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)\njava.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)\njava.base/java.lang.Thread.run(Thread.java:829)\n",
  "encodedAttributes": null,
  "cause": null,
  "applicationFailureInfo": {
    "type": "io.temporal.worker.NonDeterministicException",
    "nonRetryable": false,
    "details": null
  }
}

Similar exception in another workflow.

We use…

do you use this calculated duration directly in Workflow.sleep(duration) or are you checking it against local time maybe before setting?
Issue seems that the duration you set in sleep can can sometimes when calculated be <=0 and sometimes >0 which would then create different commands and can fail matching to event history during worker replay.

This current date time will be compared with the time popped out of the priority queue. Code looks similar to this:

Queue q; //This is constructed and returned by an activity.
while(poppedTime = getTimeFromQueue()){
LocalDatetime now = Instant.ofEpochMilli(Workflow.currentTimeMillis()).atZone(
                    ZoneId.of("UTC")).toLocalDateTime();
if(now.isBefore(poppedTime)){
Workflow.sleep(Duration.between(now, poppedTime));
}
// Some other code
}

getTimeFromQueue() - Its not an activity. Looks like we are also using current date time in this method which is retrieved using LocalDateTime.now(Clock.systemUTC()); There’s a plan to make this method as side effect to avoid popping the queue during replay. This should make it deterministic, right?

Are there any further corrections required in this flow to fix the non-determinism error? Can you help with it?

getTimeFromQueue() - Its not an activity. Looks like we are also using current date time in this method

Yes would calculate this once in SideEffect and always use the result in loop.
If you want to unblock currently “stuck” executions, could try workaround:

if(now.isBefore(poppedTime) && Duration.between(now, poppedTime).getSeconds() > 0) {
   Workflow.sleep(Duration.between(now, poppedTime));
}

Thanks for the suggestions.
Is there a way to reproduce non determinism error locally? Or verify the determinism in the workflow implementation? This way we can be sure that our code is always deterministic whenever we make some change.

@Sowmya
you can use WorkflowReplayer to replay “old” workflow histories in your test environment and ensure they will work with the new code.

Also, take a look at the documentation Workflows in Java | Legacy documentation for Temporal SDKs

Thanks @antonio.perez, will take a look into these.

How will this additional check of getSeconds()>0 help? As per my understanding, during replay priority queue will always fetch the future time as compared to now variable which is calculated using Workflow.currentTimeMillis() (will not change when replayed, so will always be a past value). Due to this, if(now.isBefore(poppedTime)) will always be true in replay and as in the actual workflow execution this may not be always true, non determinism error happens. Is this the right understanding? Just trying to understand the solution better.

How will this additional check of getSeconds()>0 help?

The real fix is to make sure you have deterministic calculation of your sleep duration. This check might help with possibly unblocking any current stuck executions (or at least you could log when its <=0 and have more insight in debugging issue).