Is it possible to resume from a failed state instead of starting over?

If a workflow failed on activity “foo”, how to resume the workflow from “foo” instead of from the very beginning?

you can use the workflow reset functionality to reuse a prefix of workflow history events: https://docs.temporal.io/docs/tctl#restart-reset-workflow

1 Like

Nice, thanks!

In the majority of cases it is better to not fail “foo”, but keep retrying it until it is fixed. Then you don’t need to do any manual operations after the fix is deployed.

Ya, but workflow failures are inevitable, so I was wondering if there is a mechanism to rerun the workflow but from the failure point

btw: seems there is no API to do the reset right?

It is part of the service gRPC API: ResetWorkflowExecution.

ah ok, i thought it is also in sdk. thanks!

is this correct? basically to find the last workflowtaskcompleted event before the activitytaskfailed event

	var lastWorkflowTask *history.HistoryEvent
	for _, event := range histories.History.Events {
		if event.GetEventType() == enums.EVENT_TYPE_ACTIVITY_TASK_FAILED {
			break
		}
		if event.GetEventType() == enums.EVENT_TYPE_WORKFLOW_TASK_COMPLETED {
			lastWorkflowTask = event
		}
	}

this specific code is working for my workflow, but not sure if this is the right way to do it

That would work. We also plan to add more policies to reset. Something like “restart from the last failed activity” would do what you want.

But as I said just having an appropriate retry policy wouldn’t require reset unless you have a bug in the workflow code.

ya, but in the real world, we can not retry forever. Some downstream failure will cause the workflow to fail. So it is definitely possible that we will manually reset some failed workflows.

Temporal doesn’t impose a limit on the duration of retries. So most of the real world users retry long enough for the downstream system to be fixed.
I understand that it is not something you are used to as messaging systems like Kafka have a hard time doing retries for a long time.

one more question:
if “restart from the last failed activity” is supported. what would be the case where one of the parallel activities is failed.

for current temporal implementation, i guess both parallel activities will be re-executed since we are reset to the workflowtaskcompleted+1 event

Yes, the current implementation of reset would reexecute both activities. The reason is that reset creates a new run of a workflow and activities are not transferable between runs. We realize that this is a serious limitation and plan to address it in the future.

i understand that the reset api is designed to be flexible to reset to any event, but if adding an option to automatically reset to the failed activity (temporal will find the reset point) would be nice :slight_smile:

2 Likes