Non-determinism after Versioning

We have a long running workflow(uses continue-as-new) with two different signal methods, which are called one after the other for updating the workflow.

  1. signal method A - calls an activity to update workflow property
  2. signal method B - updates Sleep Timer

This has been working fine, until we versioned the workflow to add a new activity called inside the signal method.

val version = Workflow.getVersion("changeForxx", Workflow.DEFAULT_VERSION, 1)
 if (version == Workflow.DEFAULT_VERSION) {
 old activity 
}else{
new activity
}

We are able to update newly created workflows executions after versioning, but we started getting NonDeterministicException errors ONLY when we update workflow executions - created before versioning - which are in sleep. Now the those workflow executions seems to be stuck.

io.temporal.worker.NonDeterministicException: Failure handling event 19 of type 'EVENT_TYPE_TIMER_STARTED' during replay. Event 19 of type EVENT_TYPE_TIMER_STARTED does not match command type COMMAND_TYPE_SCHEDULE_ACTIVITY_TASK.
Workflow task processing failure. startedEventId=37, WorkflowId=xxx, RunId=xxx. If seen continuously the workflow might be stuck.

I noticed in the workflow history the first signal's workflow task timed out. I may be wrong but I guess the second signal workflow task that has the sleep timer event got processed before the timed out workflow task was placed back on the queue. This may have resulted in mismatch in history & non- determinism?.

Anyway we reverted the versioned workflow code, deleted stuck workflow instances & recreated them. Everything is back to normal.

I am not sure how it has been working, but introduction of versioning caused this weird behavior for previous workflow instances. what do you recommend as a solution, if we are to version the workflow ?

@maxim Thoughts?

Can you share the full event history json for your execution?

EVENT_TYPE_TIMER_STARTED does not match command type COMMAND_TYPE_SCHEDULE_ACTIVITY_TASK

How are you calculating sleep duration for your Workflow.sleep?

we use the same updatable timer logic as in the sample samples-java/UpdatableTimer.java at b7809d8444f54e98fc94873922078f28484f769c · temporalio/samples-java · GitHub

I just sent you(private message) the history json file

Could you also share in dm your code if possible, or related parts to how you set the duration of the timer you crate before invoking the NearestValidDay activity?

There is definitely a couple of places in your execution where internal worker replay happens, specifically each time you see a WorkflowTaskTimedOut event. If this is related to worker restarts (for example you restarted workers after your versioning changes) this is ok, otherwise is something to look at (why your worker was not able to respond workflow task completion to service within 10s).

I don’t think that the error itself is related to your versioning change, but rather related to worker restarts that trigger internal worker replay and is caused by non-deterministic timer duration calculation given the signal payloads you pass in (so for example the timer duration calculation sometimes when run can be positive number meaning that TimerStarted event would be produced, but if ran again can produce zero or a negative number, where TimerStarted would not be produced). Would try to look into that.

Also there is no MarkerRecorded events in your history which would be produced by your Workflow.getVersion added code, so the non-deterministic error happens on internal worker history replay before that change even is processed by the execution.

Interesting…

I just shared code by DM, but we initially set the timer in the workflow like this:

 val currentDate = Instant.ofEpochMilli(Workflow.currentTimeMillis()).atZone(ZoneId.systemDefault()).toLocalDate()
        if (upcomingDate!!.toLocalDate().equals(currentDate)) {
            log.info("First recurrence is today, not sleeping.")
        } else {
            timer.sleepUntil(upcomingDate!!.atZone(ZoneId.systemDefault()).toInstant().toEpochMilli())
            log.info("Awoke from sleep.")
        }

Don’t use ZoneId.systemDefault() like that.

A workflow execution in Temporal is not tied to a specific worker process. My guess is you might have a worker process with different system time zones defined (and your exec from history you shared did “jump” workers).

What I think you should do instead is something like:

ZoneId zid = Workflow.sideEffect(ZoneId.class, () -> ZoneId.systemDefault() );

and then make sure to use zid everywhere after. That way the zone id is recorded in event history and will be used across all worker processes that might be involved in wf execution.

Thanks @tihomir. that is good to know.

A workflow execution in Temporal is not tied to a specific worker process. My guess is you might have a worker process with different system time zones defined (and your exec from history you shared did “jump” workers).

I am not saying this is not a problem, but this issue must be still be present after reverting the versioning, which is not the case. so I am not sure why that is the case.

This is not a problem that “goes away” (non-determinism).
If your executions run from start to finish on the same worker, and never gets evicted from worker cache, then you would not run into this issue even tho your code is lets say completely non-deterministic.
However if at any point during the execution it makes some progress on worker A but then moves over to worker B (for example due to worker restarts, or your workers evicting this exec from its internal memory cache) the worker that then picks up to continue the exec would do internal replay at which point you might see the non-deterministic error happen.

That’s why its useful and important to test your workflow executions with WorkflowReplayer to try to catch non-deterministic issues.

I see what you are saying. May be we just got lucky so far that with the workers, until we introduced versioning. Weird part is we are not able to reproduce the issue outside prod environment.

I agree that we should use Replayer to validate determinism.

Also how come only the previous executions got affected, but not the new ones after versioning?

Also how come only the previous executions got affected, but not the new ones after versioning?

Can you share the event history for the new execution? My guess would be that it didnt bounce workers (you can check and compare the “identity” field in each one of your WorkflowTaskStarted events to see that).

One way you can find out from event history if this happens is also looking at the “Task Queue Kind” property on WorkflowTaskScheduled events. The first workflow task is onways going to have its value to “Normal” the second then should be “Sticky” (server tries to push workflow tasks to worker that has so far processed the execution).
If after that it goes from “Sticky” to “Normal” thats good indication that internal worker replayed happened.

Correct, for new executions it did not bounce worker. Why it would only bounce workers for older executions when the timer code is the same? :confused:

You have a couple of WorkflowTaskTimedOut events in history, meaning a worker picked up the task but was not able to report task completion back to service within default workflow task timeout (10s).
This can be related to worker restarts, when you restart worker while its processing a workflow task it cannot report back, service then times out the task and puts it back onto the “Normal” task queue kind (global task queue where any workers polling on that tq can pick it up and continue processing the exec).

In your event history you can see that after WorkflowTaskTimedOut the task queue kind switches from Sticky to Normal, and you can also see a different worker identity than the identity that was processing the workflow task before the one that timed out.

It could also indicate possible issues on your workers for some reason was not able to report back in 10s, check maybe worker logs and mem/cpu utilization when you see WorkflowTaskTimedOut events and also the Timeout Type property it holds.

My guess since you are introducing changes to your code and restarting workers to apply them that it was caused by that for the exec you shared.

Even if in your use case for this code the workflow execution moves from worker A and B, the situation where this non-deterministic issue can happen is if A and B have different system time zones defined.
Sometimes a workflow execution can go from A to A and in that case you would also not get imo this error. So just fyi, it can happen “sometimes”.

I haven’t dig deep into the thread – just based on the the description, I feel like this could be related to this SDK bug: Workflow.getVersion() could cause NonDeterminsticError with multithreading + timer · Issue #1430 · temporalio/sdk-java · GitHub

Essentially, signal methods are executed as multiple threading(each signal will have a thread), and using timer + versioning could just cause issue with it.

One way to get around is to use this global versioning pattern https://stackoverflow.com/questions/73941723/what-is-a-good-way-pattern-to-use-temporal-cadence-versioning-api/73941724#73941724

It make versioning easier to use/understand, but also workaround that SDK bug.

@tihomir @qlong Thanks for your inputs.

do you think it is an issue, even if there is no multithreading involved?
I will try and replicate it locally using replay tests and keep you guys posted.

I believe it is.
In Java SDK(not in others like GoSDK), signal methods are invoked/processed in the same way as you are using Async.procedure. sdk-java/SyncWorkflow.java at c0b3f6b8d64813f7527b00218f37e5f541405550 · temporalio/sdk-java · GitHub

reviving the thread…

@tihomir @qlong We could not reproduce non-determinism with a replay test(ran successfully) in our case(that bug may not be related to this issue)

so we have been avoiding making changes to this workflow in production. but now we are at a point where we have to make changes to the workflow.

@tihomir Like you mentioned

it is related to worker restarts that trigger internal worker replay and is caused by non-deterministic timer duration calculation given the signal payloads you pass in (so for example the timer duration calculation sometimes when run can be positive number meaning that TimerStarted event would be produced, but if ran again can produce zero or a negative number, where TimerStarted would not be produced) Would try to look into that.

timer duration calculation appears to be more likely reason for non-determinism. Our application uses UTC like temporal server, so ZoneId.systemDefault() should be the same. Wondering how we can reproduce this non-determinism