Retrying a workflow for a specific error scenario

I want to retry my workflow in only one specific error scenario, for which I am raising a custom error. However, it looks like the documentation provides the inverse of it, which means that if I provide a RetryPolicy , I need to specify the errors for which I do not need to retry the workflow, not the other way. Is there anyway around this apart from encapsulating all my errors and provide them in the NonRetryableErrorTypes list? Also, What if there are temporal internal errors for which I do not want to retry the workflow but exit it? Is there a list of ErrorTypes that I need to add to the array?

2 Likes

It is a pretty philosophical question. What is the purpose of workflow retry? Workflows don’t fail due to intermittent infrastructure issues. And intermittent in the workflow timeline can be days. So the main reason for workflow failures are unknown bugs in the code. And for some workflows retrying on unknown bugs makes sense.

Thus the retry policy is designed to retry any unknown error and allow to not retry configured list of known errors.

I don’t know your use case, but in general letting workflows fail and retrying it is not a good idea. I know it is not something that comes naturally. We are all wired to retry on error working with unreliable request/reply services for ages. But workflows should be written in a way that they don’t fail on any known failures. It is usually achieved through unlimited retries of individual activities.

If you really want to retry on a known error then bake this logic into your workflow code. One approach is to wrap your main workflow function in a code that calls continue as new if the function returned a specific error.

What you said totally makes sense.

Our use case revolves around triggers for updates in the database, which starts the workflow and completes an activity by making third party api calls and inserts the value back in the database.
We want to make sure that the workflow retries through an exponential backoff only if we hit rate limits or encounter failures to write to the database(which is very unlikely), but otherwise exit gracefully (with either a success or a failure that is recorded in the database).
I’ll checkout the Continue as New option as well. Thanks for the immediate reply and giving your thoughts on this.

I don’t think in your use case you need workflow retries.

I would recommend attaching a retry policy to the DB update activity and keep retrying it until the DB call goes through.

I would also recommend running this activity on a separate task queue and use task queue rate limiting to ensure that you don’t hit the database with a rate higher than configured.

Hi Maxim,

I should have been more clear, I meant rate limits for the third party API calls that we are using. The API would return errors that might fall into two categories, errors that can be retried (eg, unexpected error, crawl failed, request timed out) and errors that should not be retried (eg, bad request). We’d like the workflow/activity to retry based on these error codes that are returned by the third party API.

Thanks

If you use the Temporal task queue rate limiting feature you can ensure that you don’t call the third party API above configured limit.

An activity can mark an error (or Exception in case of Java SDK) as non retryable looking into the error code returned by the third party.

1 Like

Thanks! That should solve the problem. I’ll check it out

@maxim Is there any example I can follow for the second part -
“An activity can mark an error (or Exception in case of Java SDK) as non retryable looking into the error code returned by the third party.”

@humblefool In Go return error created through temporal.NewNonRetryableApplicationError and in Java throw an exception created through ApplicationFailure.newNonRetryableFailure.

Hi @maxim,

I have tried throwing “ApplicationFailure.newNonRetryableFailure” from my activity for specific exceptions but I see the activity is still retried as specified in retryOptions. Am I missing something?

I assume that you are using Java SDK. I modified a sample to throw ApplicationFailure.newNonRetryableFailure and it wasn’t retried.

Could you create a reproduction of your problem and file an issue?

Continuing on the informative discussion on manually retrying here, what about forcing Temporal to do a retry outside previously defined retry policies?

Say I want to short circuit the current back-off cycle and have a flow retried immediately, because we fixed a known infra issue, is there a way to do that? Or do I have to tell my business owner to wait until Temporal is ready to retry? Ideally I would’ve liked a Retry button next to the Terminate button in the server dashboard.

One way i can think of achiving what you want is to reset the worflow to the previous completed event (through apis)

This is not supported, but we have this feature in our backlog. Currently, I recommend setting RetryOptions.maxRetryInterval to a reasonable value. This allows timely retries even in the case of a prolonged outage.

Hi maxim, do you mean

ResetWorkflowExecutionRequest reset = ResetWorkflowExecutionRequest.newBuilder()
.setNamespace(nsconfig.getDefaultNamespace()).setWorkflowExecution(exectuionInfo.getExecution())
.setWorkflowTaskFinishEventId(eventId).setReason(reason).build();

		ResetWorkflowExecutionResponse response = workflowClientFactory
				.resetWorkflow(nsconfig.getDefaultNamespace(), reset);

Wont work? the eventId can be last completed event, so that one can force a manual trigger from specific step onwards.

@madhu I don’t understand the last question. resetWorkflow works given that eventId is correct.

Yes, my suggestion is cant we use teh resetworkflow(with last completed event id) to achive what @Benny_Bottema is looking for (i.e. continue /retry from a specific step)

Yes, it is possible to use reset to retry workflow from a specific point.

I don’t recommend using reset unless you are trying to work around some bug. In expected failure situations, activity retries should be used to ensure that workflow never fails.

Hi @maxim ,

We have a workflow where we have implemented retrial policies for all the activities. We have a use case where we have to retry on customer request outside the retrial policies. What do you suggest will be best way to accomplish this.

Would you elaborate on what “retry on customer request outside the retrial policies” means?