Resetting workflow that requires a previously completed step to re-run

Hello, we have a workflow that we’d like to reset to a prior point if it fails to complete. However this workflow is dependent on a previous step that runs earlier in the workflow - and since its marked complete, Temporal correctly skips it on the reset workflow. However, we would like to ensure this step always runs - regardless of if it was completed successfully previously. Is there a way to make the step always run when a workflow restarts (or preferably only for a reset)? we have the ability to handle the clean up appropriately.

Thanks

If you have an expected failure: there’s something you know might happen that would cause the workflow to fail to complete, you should build that into the workflow execution.

At the point where it might fail, handle the failure by pausing the workflow until it receives a signal to continue. The workflow execution logic can then itself return to the previous step that needs to be run again.

1 Like

Thanks for the reply. In our case, its not expected failures.

The actual situation is this. At the beginning of a workflow, we use a step to trigger some tenant-specific worker setup code - so that the workers can take care of some tenant-specific activities. Think of them as K8 commands to set up some worker pods. When a workflow fails, we cleanup those workers - as they would otherwise sit idle.

In this setup, if a workflow fails after the initial worker setup code, we can’t use the workflow reset functionality - since we’re only resetting the workflow to a middle step (not all the way back to the initial setup step).

If there was a way to reset the workflow and during the control-flow, particular steps always got executed - regardless of previous execution/completion, then we can reset the failed workflow and pickup the work from the point where the failure happened - since the workers would get initiated again at the start of the new run caused by the reset.

Thanks!

What causes the workflow to fail to complete?

e.g.,

  • The workflow code throws an exception.
  • The workflow calls an activity which fails, and the workflow chooses not to catch that error.
  • The workflow is terminated.
  • The workflow is cancelled, and the workflow chooses not to catch that exception.

When the workflow fails, is there anything you do before resetting the workflow? Are you… fixing a bug in the workflow code? Fixing some input? Fixing some facility or service that the workflow uses?

It’s hard for me to tell without better understanding what you’re trying to do, but one thought is to use two workflows: have a parent workflow that runs the setup code and then runs the child workflow. If the child workflow fails, the parent workflow can alert you and wait for a signal before running the setup code and the child workflow again. That way you can change, reset, fix, etc. the child workflow without losing the state in the parent workflow that the setup code needs to be run each time.

Thanks for the response. The reasons we typically would attempt a reset for is for are:

  • configuration issues (results in an activity failure)
  • transient issue (timeout type issue)
  • code issue in activity

In some of these cases we’d like to ask the customer to fix something or fix something on our end and reset - to avoid a costly re-run.

Seems like the multi-workflow approach is not that straightforward. We would have launched a bunch of resources to execute the workflow that would remain up until someone could respond to the failure. Our use case is for special failure scenarios, we would like to have a way to reset where some activity will execute each time regardless of if it was already done. We can handle idempotency within the step. It’s a little bit like ‘taint’ functionality in Terraform. By tainting something, you’re able to refresh that particular node.

Personally, I would treat a reset as an emergency response to an unexpected situation.

But, if I understand you correctly, you have activities that sometimes require manual intervention, and you want to have a process to handle that occurrence. This isn’t unexpected; it’s part of the requirements that you’re attempting to design for.

My suggestion is that you build the process you need into the workflow. Don’t reset the workflow, instead implement the workflow to handle the failure that you expect may sometimes happen. This is what workflows are for: to execute the steps of the process you need to implement. In particular, the workflow can encode the knowledge of when the setup step needs to be run.

If an activity needs manual intervention, the workflow can let you know and then wait for a signal; when you’ve resolved the issue with the activity you can signal the workflow to run the setup and the activity again.

1 Like

Will think through this more. Thank you, Andrew!

1 Like