Workflow Retry policy seems to not be getting respected

ccuevas · February 29, 2024, 7:28pm

Hey,

Tried looking to see if this issue has come up before but I’m not seeing it. We are trying to model what a Workflow failure would look like (understanding that this should be a rare occurrence), and essentially have a code block within a workflow that looks like

random_number_between_0_and_100 = workflow.execute_activity_method(ActivityClass.generate_random_number, start_to_close_timeout=timedelta(seconds=5)
if random_number_between_0_and_100 > 50:
    raise ValueError("This step failed")

When I run this workflow with

client.excecute_workflow(
             ...,
            retry_policy=RetryPolicy(
                initial_interval=timedelta(seconds=1),
                backoff_coefficient=1,
                maximum_attempts=0,
            ),

I get this kind of strange behavior in the workflow that looks like this screenshot:

This is being run in the WorkflowEnvironment.start_local environment for Python (maybe this would run differently in a different environment).

The failed workflow task retries within 0.04 seconds, not respecting the retry policy initial_interval, and doesn’t retry again.

I’ve seen some documentation that says that Workflows can take retry policies that should apply for the workflow execution, while some posts on this forum have (perhaps erroneously) wondered whether the retry policy passed in client.execute_workflow just acts as the RetryPolicy for all of the activity executions.

I would like to know if it’s possible to have a workflow retry (or not retry) from a failed state (without necessarily having the failure be a non-retryable error). It seems at least in my local usage, I can’t get the workflow to retry more than once and it does not respect my RetryPolicy.

Chad_Retz · February 29, 2024, 8:15pm

ValueError is not a workflow failure, it is a workflow suspension (i.e. task failure that continually retries waiting for a code fix). You’ll want to use temporalio.exceptions.ApplicationError. See this section of the README.

ccuevas · February 29, 2024, 8:39pm

Hey thanks for this.

I will try this again with an ApplicationError.

In the meantime, it doesn’t appear that the workflow is rerunning from the ~~failed~~ suspended workflow task, it looks like it’s just hanging. Is there some way to get it to rerun? You also said it’s waiting for a code change, what would that imply? Do you mean that it’s just waiting for us to update the workflow and do workflow replay, or is there some other mechanism I don’t know about.

Thanks

Chad_Retz · February 29, 2024, 9:30pm

It is meant to hang because it is considered a “code bug” expecting a deployment of code that doesn’t throw this exception to automatically fix it.

Simply restarting the worker with the code that doesn’t raise that exception will automatically allow it to continue on.

ccuevas · February 29, 2024, 9:32pm

Great, I think I understand! Thanks! I was able to get the workflow to respect the RetryPolicy when I used ApplicationError.

Topic		Replies	Views
Retrying a workflow for a specific error scenario Community Support	21	4709	February 16, 2024
Documentation on retries when throwing errors is not clear Community Support error-handling , activity , workflow-options , typescript-sdk , failures	0	205	March 13, 2024
Recover from wf failure manually how to test properly Community Support python-sdk	5	607	May 17, 2023
Does changing retry policy break existing workflows Community Support	1	377	January 28, 2024
Workflow retries logic Community Support go-sdk	5	634	April 10, 2023

Workflow Retry policy seems to not be getting respected

Related topics