How to automatically restart workflow which have non deterministic error

Is there a way to automatically restart workflows that have errors like non-deterministic,
I was trying to catch the error and do continue_as_new, but I have read that I can’t catch non-deterministic errors.
Does workflow have some configuration to just restart on non-deterministic error?
I also have another approach, create a cronjob that list all workflows that have that error then restart it, is it possible to do this?

No, the workflow is intentionally “hung” with task failure which you can see in the UI. If a non-deterministic error occurs, you shouldn’t just restart it, you should fix the code so it cannot ever occur again. Redeploying the code will usually automatically continue the workflow if you solved the non-determinism, but if the code change cannot solve it (sometimes it can’t), you can terminate the workflow and then restart it.

Thanks @Chad_Retz
I didn’t clarify my use case, let me clarify
let’s say I deployed some code changes in the workflow without using patching, then the workflow may throw a non-deterministic error (because history doesn’t match),
and let’s say there were thousands workflows faced that issue,
how can I terminate and restart those workflows have non-deterministic error automatically? is that possible to list the workflows have that error with python sdk?

Ah, we admittedly don’t have a good “select all running workflows with this specific task failure occurring”. You can list running workflows by a specific workflow type (i.e. workflow name/class) but I am afraid ones that have a certain error you can’t.

But if you must do this, you can fetch low-level history events and check task failure. It’d be a bit advanced, but it’d look something like (untested, just typed here in chat):

bad_ids = []
async for wf in my_client.list_workflows("WorkflowType = 'MyWorkflow'"):
    handle = my_client.get_workflow_handle(wf.id, run_id=wf.run_id)
    async for event in handle.fetch_history_events():
        # Check low-level protobuf event
        if event.HasField("workflow_task_failed_event_attributes"):
            if event.workflow_task_failed_event_attributes...
                bad_ids.append(wf.id)
                break

See the proto API at api/message.proto at master · temporalio/api · GitHub

2 Likes

Thanks a bunch! @Chad_Retz
this will help a lot!

Is there any other way to handle this nicely without operational overhead? In some cases, we aren’t concerned about the internal state of the workflow per se, as the actual state is queried from an external source anyway. Currently, if a developer changes something, that might lead to non-determinism on existing flows we basically have to call reset on all the impacted workflows with a script. Is this the best way to do it?

Because as I understand it, the case for patching is if you want to keep compatibility of running flows, but what’s the best way to handle it we do not care to about this that much?

Any tips would be greatly appreciated.