How to automatically restart workflow which have non deterministic error

Is there a way to automatically restart workflows that have errors like non-deterministic,
I was trying to catch the error and do continue_as_new, but I have read that I can’t catch non-deterministic errors.
Does workflow have some configuration to just restart on non-deterministic error?
I also have another approach, create a cronjob that list all workflows that have that error then restart it, is it possible to do this?

No, the workflow is intentionally “hung” with task failure which you can see in the UI. If a non-deterministic error occurs, you shouldn’t just restart it, you should fix the code so it cannot ever occur again. Redeploying the code will usually automatically continue the workflow if you solved the non-determinism, but if the code change cannot solve it (sometimes it can’t), you can terminate the workflow and then restart it.

Thanks @Chad_Retz
I didn’t clarify my use case, let me clarify
let’s say I deployed some code changes in the workflow without using patching, then the workflow may throw a non-deterministic error (because history doesn’t match),
and let’s say there were thousands workflows faced that issue,
how can I terminate and restart those workflows have non-deterministic error automatically? is that possible to list the workflows have that error with python sdk?

Ah, we admittedly don’t have a good “select all running workflows with this specific task failure occurring”. You can list running workflows by a specific workflow type (i.e. workflow name/class) but I am afraid ones that have a certain error you can’t.

But if you must do this, you can fetch low-level history events and check task failure. It’d be a bit advanced, but it’d look something like (untested, just typed here in chat):

bad_ids = []
async for wf in my_client.list_workflows("WorkflowType = 'MyWorkflow'"):
    handle = my_client.get_workflow_handle(wf.id, run_id=wf.run_id)
    async for event in handle.fetch_history_events():
        # Check low-level protobuf event
        if event.HasField("workflow_task_failed_event_attributes"):
            if event.workflow_task_failed_event_attributes...
                bad_ids.append(wf.id)
                break

See the proto API at api/message.proto at master · temporalio/api · GitHub

1 Like

Thanks a bunch! @Chad_Retz
this will help a lot!