What to do when an activity cannot proceed without re-running previously completed activities?

Suppose I have a file-processing workflow with three activites A → B → C. Each activity writes their results to the disk and pass file paths to the next activity. However, suppose the worker crashes during activity C, and now the files from activity B is gone. Temporal will retry activity C, but it cannot proceed without the files. What now?

I’m considering the following options:

  1. Fail the entire workflow by throwing a non-retryable exception in activity C, then enable retry on the workflow. But this forces retries even when activities are timing out.
  2. Trigger a reset. This also restarts the workflow from the start, but the python sdk doesn’t have good support for it.

Have your workflow handle the logic of running B again if C doesn’t have the files it needs.

Resetting a workflow is for when a bug in your workflow code is preventing the workflow from running. It isn’t something you want to use when the workflow is running correctly because you lose the state of the workflow.

Having Temporal do the activity retry login for you is a convenience: you always have the option to implement the retry logic in your workflow if you wanted to.

If the retry logic you need to implement is more complicated than what Temporal handles for you, you can implement it in the workflow instead.

Have your workflow handle the logic of running B again if C doesn’t have the files it needs.

This gets more difficult as the workflow gets more complex. In actual usage, my workflows tend to look like DAGs – activity A depends on B, which depends on C and D, also C depends on E, etc. Re-running an activity means re-running all of its transitive dependencies in the correct order.

I could indeed implement all the retry logic by myself, but I was hoping Temporal could help me out somewhat.

I would run A->B->C in a retry loop.

1 Like

Well, in Temporal ideally activities would be idempotent. If activities saved their output to durable storage (S3, persistent disk, etc), then once for example “B” had run it would be done: C could run multiple times and still have the data from B each time. If you’re not able to do that you’re probably going to have to orchestrate something yourself, would be my guess.

Just a thought, perhaps each activity could return which files it was missing when it wasn’t able to run. You’d run “C” first, it would report back “I don’t have files from B”; so you’d run B next (and it would say “I don’t have files from A”, etc)… after you had run the dependencies, you’d run C again, and maybe the worker had crashed at some point and didn’t have the data, and C would report again “I need files from B”.

This way you’d handle the DAG of dependencies but it would be driven by the activities instead of having to keep track yourself which files might have been lost if a worker crashes.

1 Like