Exception handling due to Workflow await

Hi guys,

we have observed following behaviours when the workflow is in the await status, that means when Workflow.await is called with a time out variable.

1.) On the one hand we have an exception when the time out is reached but the application went down before. When the application/pod is then restarted, we have following exception in the console during startup:

Caused by:
io.temporal.worker.NonDeterministicException: Failure handling event 17 of type ‘EVENT_TYPE_TIMER_STARTED’ during replay. Event 17 of type EVENT_TYPE_TIMER_STARTED does not match command type COMMAND_TYPE_SCHEDULE_ACTIVITY_TASK. {PreviousStartedEventId=15, WorkflowTaskStartedEventId=25, CurrentStartedEventId=15}

Have you any idea how to catch this exception during startup and terminate the workflow, because the workflow itself is stuck and doesn’t respond anymore and stays in the running status.

2.) On the other hand when the workflow.await is called and the workflow waits for e.g. 2 days and during this time the application is redeployed in the Kubernetes cluster, we get a WorkflowTaskTimedOut exception (timeout type: ScheduleToStart) when the workflow wants to continue? But no code was changed in the workflow or activities.

Same question as before. How can we avoid this issue? Due to current development requirements we have no TimeOut set at all. Of course this is not recommended, but currently necessary in out dev phase.

BW Maik

Hello @maikdrop

  1. It looks like you have a non-deterministic condition in your workflow.await. If you want to show your workflow code we can review it.

Are you calculating the workflow.await duration based on the System.time?

NonDeterministicException is not propagated to the workflow code, see:

It is worth reading this as well

  1. This is expected based on what you have described. The workflowTask scheduleToStart timeout is 10 seconds. After the workflow.await completes (TimerFired event) a WorkflowTask is put in the queue by the server (WorkflowTaskScheduled) and if your worker is down the workflowTask is not picked up. Once your worker comes back, the worker is going to pick up the workflowTask and the workflow execution will continue.

@maikdrop are you on the recent version of the Temporal service and the SDKs? I believe the issue with the “ScheduleToStart” timeout after a long delay was addressed already.

Hi Maxim,

that’s a valid point. Our versions are not up to date. Ok, we will update the SDK and the Temporal version. After I’m going to to try to reproduce the described issue again. Thx

Hi Antonio,

1.) The await duration is calculated based on a dateTime entered by a user in the Front End. It’s currently always the end of day of the entered date. This date is stored in a DB entity. And this entity is put into the workflow.

2.) Yes, this is how it should be. But the worker is doesn’t pickup the workflow task. Instead we get an exception. I haven’t documented it, but I think we have a io.temporal.worker.NonDeterministicException as well.

Hi Maxim,

would you recommend to update to the recent version v1.18.2 or a previous one? It was released yesterday.

BW Maik

@maikdrop

  1. Can you check if the value (duration) passed to Workflow.await is calculated in a deterministic way? Are you using System.currentTimeMillis() to calculate the duration? it would help if you share the code

  2. Sorry, I am confused here. You mentioned before that the error was WorkflowTaskTimedOut. Are we talking about two different issues?
    io.temporal.worker.NonDeterministicException is thrown by the SDK when it runs the code (on replay) in the worker. If that is the error, I think that the worker picked up the workflow task.

@maikdrop

Workflow.await/sleep creates a TimerStarted event if duration value is > 0. If duration value is <= 0 it won’t create the TimerStarted event.

From the error you have shared, it looks like you have a Workflow.await/sleep (duration) followed by an activity invocation. The workflow code is executed, and it records a TimerStarted event in the Event History. Then, on replay, the value of duration is calculated again with a value <= 0. The SDK replays the Event History and it expects to find an ActivityTaskScheduled event instead of a TimerStarted event.

1.) It’s calculated in a deterministic way. The input date comes from a DB entity, which gets the time from a user input.

fun getDateTimeFromEndOfDay(date: OffsetDateTime): OffsetDateTime = OffsetDateTime.of(
date.toLocalDate(),
LocalTime.of(23, 59, 59, 0), ZoneOffset.UTC
)

2.) The WorkflowTaskTimedOut is shown via Temporal WebUI. And I’m not to 100% sure, which exception was shown in the application console. But I’am pretty sure that there was an output. But unfortunately I haven’t documented it.

For 1.
Is fun getDateTimeFromEndOfDay called directly in your workflow code currently? If so I would move it to SideEffect or local activity (prob better side effect) and use result to pass as duration to Workflow.await/sleep. This way it will be recorded in workflow history and on
workflow replay (which happens when you shut down your worker(s) and they come back up) it would not be calculated again and the recorded result used instead.

This is because

Workflow.await(Duration.of(n), () → condition);

produces a different event history if n<=0 and when n > 0
which we believe leads to the non deterministic error you pasted.

Yes it’s calculated in the workflow. What do you mean with

move it to SideEffect

Thanks for the explanation!!!

What do you mean with move it to SideEffect

Check out sample: samples-java/HelloSideEffect.java at main · temporalio/samples-java · GitHub
In SDK: sdk-java/Workflow.java at master · temporalio/sdk-java · GitHub

How do you create the date argument passed to getDateTimeFromEndOfDay. If you create it from Workflow.currentTime then it should be deterministic.

Just to add, opened SDK feature request [Feature Request] Workfow sleep/await with 0 duration · Issue #145 · temporalio/sdk-features · GitHub

We have a DB entity, which has a time property. The value of this property was entered by a user in the front end. The created entity is put into the workflow and the time property is the input of getDateTimeFromEndOfDay.

" is put into the workflow "

Is it passed as a workflow argument? Then it should be deterministic.

Yes, it’s passed as an argument to the workflow.

I’m not sure if this is deterministic if worker’s time zone changes. I would look through your code for other sources of non determinism.