Workflow Retry policy seems to not be getting respected

Hey,

Tried looking to see if this issue has come up before but I’m not seeing it. We are trying to model what a Workflow failure would look like (understanding that this should be a rare occurrence), and essentially have a code block within a workflow that looks like

random_number_between_0_and_100 = workflow.execute_activity_method(ActivityClass.generate_random_number, start_to_close_timeout=timedelta(seconds=5)
if random_number_between_0_and_100 > 50:
    raise ValueError("This step failed")

When I run this workflow with

client.excecute_workflow(
             ...,
            retry_policy=RetryPolicy(
                initial_interval=timedelta(seconds=1),
                backoff_coefficient=1,
                maximum_attempts=0,
            ),

I get this kind of strange behavior in the workflow that looks like this screenshot:

This is being run in the WorkflowEnvironment.start_local environment for Python (maybe this would run differently in a different environment).

The failed workflow task retries within 0.04 seconds, not respecting the retry policy initial_interval, and doesn’t retry again.

I’ve seen some documentation that says that Workflows can take retry policies that should apply for the workflow execution, while some posts on this forum have (perhaps erroneously) wondered whether the retry policy passed in client.execute_workflow just acts as the RetryPolicy for all of the activity executions.

I would like to know if it’s possible to have a workflow retry (or not retry) from a failed state (without necessarily having the failure be a non-retryable error). It seems at least in my local usage, I can’t get the workflow to retry more than once and it does not respect my RetryPolicy.

ValueError is not a workflow failure, it is a workflow suspension (i.e. task failure that continually retries waiting for a code fix). You’ll want to use temporalio.exceptions.ApplicationError. See this section of the README.

Hey thanks for this.

I will try this again with an ApplicationError.

In the meantime, it doesn’t appear that the workflow is rerunning from the failed suspended workflow task, it looks like it’s just hanging. Is there some way to get it to rerun? You also said it’s waiting for a code change, what would that imply? Do you mean that it’s just waiting for us to update the workflow and do workflow replay, or is there some other mechanism I don’t know about.

Thanks

It is meant to hang because it is considered a “code bug” expecting a deployment of code that doesn’t throw this exception to automatically fix it.

Simply restarting the worker with the code that doesn’t raise that exception will automatically allow it to continue on.

Great, I think I understand! Thanks! I was able to get the workflow to respect the RetryPolicy when I used ApplicationError.