How to reset a failed "activity" gracefully

I have big workflow, it will running for hours. some activity might failed and break the retryPolicy and cause the workflow running failed

when the running failed, I need to reset the workflow to continue from the failed activity. it is easy if the flow’s activity serially. just find the failed activity and get it’s “WorkflowTaskCompletedEventId” then reset the flow. but things got wired in parallel activity workflow

a parallel flow like below,

func SimpleParallelFlow(ctx workflow.Context) error {
	// ctx prepare

	futures := make([]workflow.Future, 0, 2)
	for index := 0; index < 2; index++ {
		future, settable := workflow.NewFuture(ctx)
		futures = append(futures, future)

		workflow.Go(ctx, func(ctx workflow.Context) {
			err := workflow.ExecuteActivity(ctx, PreCheck).Get(ctx, nil)
			if err != nil {
				settable.SetError(err)
				return
			}

			err = workflow.ExecuteActivity(ctx, DoSth).Get(ctx, nil)
			if err != nil {
				settable.SetError(err)
				return
			}
		})
	}

	// handle future

	return nil
}

it simple, graph like

the problem is that, when activity “preCheck-1” failed while activity “preCheck-2” and “doSth-2” succeed. I need to reset the flow running to rerun “preCheck-1”, however I found that “preCheck-1” and “preCheck-2” has same “WorkflowTaskCompletedEventId” which means not only “preCheck-1” will be rerun but also “preCheck-2”. that’s not what i want

is there a graceful way to solve it? reset the flow and only “preCheck-1” will be rerun and nothing happen to the succeed activity “preCheck-2”

Does in your case PreCheckX have to complete before DoSthX?

I think you could run each of the PreCheckX->DoSthX activities as a parallel branch, see sample here. Each branch could retry the sequence of precheck->doX if you need, sample here.

I found that “preCheck-1” and “preCheck-2” has same “WorkflowTaskCompletedEventId”

Yes this is i believe an optimization Temporal does to keep the workflow history smaller, see this post for another example.

With reset and async invocations I think would be pretty difficult to rely on it for your use case. Would consider trying out the branch approach and dealing with retries in workflow code rather than falling back on reset.

Would consider trying out the branch approach and dealing with retries in workflow code rather than falling back on reset.

in the case, “preCheck” and “doX” rely on the outer services, it is unreliable. it might be down for hours or days. reset the flow to rerun the failed activity is necessary. in the branch, it is foreseeable that some case might break retryPolicy

how can I achieve it, if “reset” can’t handle this, is there any other way ?

it might be down for hours or days. reset the flow to rerun the failed activity is necessary

Not sure it’s necessary. You don’t have the set a ScheduleToClose timeout n your ActivityOptions (and no max retries if you specify a retry policy). This would allow your activity to retry as long as you need.