Best Technique for Restarting Entire Workflow in Python SDK?

bxbrenden · August 11, 2023, 10:41pm

Hi Temporal Community,

I’m new to Temporal, and I’m having trouble figuring out the best way to restart an entire Workflow from the beginning using the Python SDK.

TL;DR

How do I restart a whole Workflow in the Python SDK under certain circumstances that are not related to intermittence?

Background

I have a Workflow that currently contains two Activities:

Make an HTTP request to a REST API to kick off a long-running job. The API returns a job ID that I can use to query the status.
Periodically query the API for the status of the job.

In step 2, the job’s status can be one of: ["pending", "running", "successful", "failed"].

Temporal-izing It

If I were not using Temporal, I would accomplish step 2 by repeatedly polling the API for the job’s status using exponential backoff and retries via the backoff package from PyPI. But, one of the promises of Temporal is (optionally) unlimited retries in the face of intermittence.

So, instead of using my own retry logic, I’ve written the Activity so that it hits the API and raises an Exception unless the job’s status is either “successful” or “failed”, because “pending” and “running” mean the job is not done yet. This technique works, and it’s really nice! The worker will keep retrying my Activity as it transitions from “pending” to “running” to “successful”. Every failure along the way triggers a retry for the Activity as expected. Once it’s done, the Workflow is marked as completed.

But, my problem is the “failed” case. If the job on the remote service fails, then both Temporal Activities are done, since the job has been launched in step 1 and no more polling is needed for step 2. But, my overall Workflow should be considered a failure because the job that launched in step 1 ended up failing for reasons outside of my control. For the system in question, that means I need to try the whole Workflow over, repeating both Activities to launch a new job via the same API and check its status.

Alternatives I’ve Tried

My first thought was that I could package both steps 1 and 2 into a single Activity. After all, they’re kind of atomic in the sense that if one or the other fails, the whole thing should be retried from the start. My problem with that approach is that I will lose my automatic polling in step 2. Rather than letting Temporal repeatedly query the API until a desired state is reached, I would have to program that polling myself with some kind of exponential backoff-- diminishing the purpose of me using Temporal in the first place.

So, my next thought was that I could retry a whole Workflow rather than just a single Activity.

Workflow Retry Best Practices?

Based on my search of this forum and the Temporal docs for the Python SDK, there seem to be a couple of options for restarting Workflows:

Custom Retry Policies can be used to retry whole Workflows, but the docs suggest that this should be used only rarely. Plus, the Retry Options don’t seem very configurable (e.g. can’t specify from which point to start over).
This post talks about using a “Reset” function of some sort, but the answer only contains examples for the TypeScript, Go, and Java SDKs, and I can’t find anything about a Reset in the Python SDK docs.

Questions

Is my use of Activities idiomatic? For example, when I raise an Exception instead of using some kind of homegrown while loop for repeatedly polling an API, is that an intended use case for Temporal?
Should my two steps be squeezed into one Activity? If so, what happens to my nice, automatic retry logic in step 2?
Are there any alternatives approaches you can recommend for this situation (using the Python SDK specifically)?

Thanks, and sorry for the long post!

maxim · August 12, 2023, 3:58pm

Yes, this is idiomatic. We have plans to allow activity rescheduling next attempt without raising an exception to make it nicer. But for now, your approach is fine. An alternative for frequent polls is to implement polling loop directly inside the activity. Such activity would need to heartbeat to be restarted in case of worker failure.
No. Two steps with such different timeout and retry characteristics don’t belong in a single activity.
In this case, failing the whole workflow and relying on its retry options is fine. Another approach is a loop that retries these two activities. Such workflow would need to call continue-as-new after a few hundred iterations of the loop

bxbrenden · August 21, 2023, 3:53pm

Thank you for the prompt reply. I was able to create a workflow with two activities and a retry policy. Then, in the event the second activity failed in a certain way, I raised an exception that would cause the whole workflow to start over.

Then, I took that workflow and made it a child workflow of a larger, main workflow. All of this fit together nicely and solved my issue. Thanks again!

Topic		Replies	Views
What to do when an activity cannot proceed without re-running previously completed activities? Community Support python-sdk	4	69	December 20, 2024
Is it possible to resume from a failed state instead of starting over? Community Support	14	3958	December 15, 2020
Activities in Python Community Support python-sdk , activity	3	397	March 12, 2024
Worker does not start activity after restart Community Support go-sdk , retries , worker	17	3387	May 24, 2021
Quering workflow persisted data and restarting workflow from where it got terminated Community Support java-sdk , elasticsearch	4	1124	May 4, 2021