How to automatically restart workflow which have non deterministic error

Youssef_Saeed · March 10, 2023, 2:50am

Is there a way to automatically restart workflows that have errors like non-deterministic,
I was trying to catch the error and do continue_as_new, but I have read that I can’t catch non-deterministic errors.
Does workflow have some configuration to just restart on non-deterministic error?
I also have another approach, create a cronjob that list all workflows that have that error then restart it, is it possible to do this?

Chad_Retz · March 10, 2023, 12:57pm

No, the workflow is intentionally “hung” with task failure which you can see in the UI. If a non-deterministic error occurs, you shouldn’t just restart it, you should fix the code so it cannot ever occur again. Redeploying the code will usually automatically continue the workflow if you solved the non-determinism, but if the code change cannot solve it (sometimes it can’t), you can terminate the workflow and then restart it.

Youssef_Saeed · March 10, 2023, 3:41pm

Thanks @Chad_Retz
I didn’t clarify my use case, let me clarify
let’s say I deployed some code changes in the workflow without using patching, then the workflow may throw a non-deterministic error (because history doesn’t match),
and let’s say there were thousands workflows faced that issue,
how can I terminate and restart those workflows have non-deterministic error automatically? is that possible to list the workflows have that error with python sdk?

Chad_Retz · March 10, 2023, 4:30pm

Ah, we admittedly don’t have a good “select all running workflows with this specific task failure occurring”. You can list running workflows by a specific workflow type (i.e. workflow name/class) but I am afraid ones that have a certain error you can’t.

But if you must do this, you can fetch low-level history events and check task failure. It’d be a bit advanced, but it’d look something like (untested, just typed here in chat):

bad_ids = []
async for wf in my_client.list_workflows("WorkflowType = 'MyWorkflow'"):
    handle = my_client.get_workflow_handle(wf.id, run_id=wf.run_id)
    async for event in handle.fetch_history_events():
        # Check low-level protobuf event
        if event.HasField("workflow_task_failed_event_attributes"):
            if event.workflow_task_failed_event_attributes...
                bad_ids.append(wf.id)
                break

See the proto API at api/message.proto at master · temporalio/api · GitHub

Youssef_Saeed · March 10, 2023, 7:42pm

Thanks a bunch! @Chad_Retz
this will help a lot!

arttii · June 5, 2024, 1:49pm

Is there any other way to handle this nicely without operational overhead? In some cases, we aren’t concerned about the internal state of the workflow per se, as the actual state is queried from an external source anyway. Currently, if a developer changes something, that might lead to non-determinism on existing flows we basically have to call reset on all the impacted workflows with a script. Is this the best way to do it?

Because as I understand it, the case for patching is if you want to keep compatibility of running flows, but what’s the best way to handle it we do not care to about this that much?

Any tips would be greatly appreciated.

mdiamond · March 11, 2025, 8:48pm

I’d also like to know more about how to handle this.

Topic		Replies	Views
Recover From Non-Deterministic Failure With Minimum Downtime Community Support	2	705	December 12, 2023
How to stop non-deterministic error retry forever? Community Support java-sdk , error-handling	5	2269	October 6, 2023
Unable to Simulate Non-determinism Error Community Support go-sdk	0	332	July 24, 2023
Find non-determinism issues from the web-ui Community Support web-ui	2	271	August 20, 2024
Catching NonDeterministicException Community Support java-sdk , general-impl	9	1359	February 9, 2024

How to automatically restart workflow which have non deterministic error

Related topics