Is having retry for both Worklfow and its activity redundant?

Questions

  1. I have an activity which makes network call and a workflow that holds this activity. Does having 20 max retries for both the workflow and the activity inside the workflow result in total of 20 x 20 network call? I mean, 20 times of activity failure constitutes 1 workflow failure. So 20 Workflow failure should result in 20 x 20 = 400 retries of network call?

  2. I have 2 activities which make 2 different network calls and a workflow that holds these two activities. Both the activites and the workflow have max 20 retries. In the scenario where the first activity is successfull, and the second one keeps failing, after 20 retries of 2nd activity, the activity gives up . Now when the workflow kicks in for the 2nd time, will the first activity which was successfull be executed again? or is it just the failed activity be executed?

  3. In what situations should we have retries for both workflow and activity?

1 Like

1, 2: On each workflow retry, it is executed from the beginning which means already executed activities will be re-executed on the new run. Generally you do not want to fail your workflows on “intermittent” errors, for example your activity errors which could be fixed via new deployment or code changes/fixes.

3: Generally failing workflow execution is not the best idea. One use case for workflow retries is when you want to retry a workflow on unknown failures. You should however if possible make sure your workflows do not fail on known errors. In the case where you do want to retry on a known failure you should still handle that error in your workflow and have the control how to retry (use continueasnew for example explicitly).

Tihomir,

I’m finding a bit hard to understand workflow retry.

Please check the attached image.

I’m throwing a excpetion to fail the child workflow.

The child workflow fails and retries.

Below is my child worklfow options

ChildWorkflowOptions workflowOptions =
          ChildWorkflowOptions.newBuilder()
              .setRetryOptions(RetryOptions.newBuilder().setMaximumAttempts(3).build())
              .build();
      GreetingChild child = Workflow.newChildWorkflowStub(GreetingChild.class, workflowOptions);

Questions

  1. Event though I have the max attemps set to 3, the child workflow seems to be retrying more than that. I have added a log just before I throw the exception. This log is appearing more than 3 times.

  2. After the child workflow fails, based on your comments, for the next retry, I was excpecting a new workflow with different run id to be created. But looks like nothing like that happened

Unknown exceptions in your child workflow code which are set in setFailWorkflowExceptionTypes are converted to ApplicationFailure. Otherwise the workflow task is failed and your workflow gets blocked (the exception is treated as a bug that can be fixed on a future deployment which is what you are seeing in your test).

If you want to throw checked exceptions not in that list, you can use:

ApplicationFailure.newFailure
or
ApplicationFailure.newNonRetryableFailure

so in the HelloChild sample you are working off you can do in your child workflow:

throw ApplicationFailure.newFailure("Failing child workflow", "simulated");

(note you do not need that @SneakyThrows Lombok annotation)

and in your workflow you could do:

try {
        return child.composeGreeting("Hello", name);
        // return Async.function(child::composeGreeting, "Hello", name).get();
      } catch (ChildWorkflowFailure failure) {
        Workflow.getLogger(GreetingWorkflowImpl.class)
            .error(failure.getCause().getMessage());
        ...
      }

This should honor your ChildWorkflowOptions->RetryOptions and you should see the child workflow being retried 3 times before you catch the error in your workflow.

Child workflow invocations always throw ChildWorkflowFailure and you can get the original failure in the cause.

Also note that workflow retries only apply if you define RetryOptions in your WorkflowOptions.
So if you in your main method of the HelloChild Sample define:

GreetingWorkflow workflow =
        client.newWorkflowStub(
            GreetingWorkflow.class,
            WorkflowOptions.newBuilder()
                .setRetryOptions(RetryOptions.newBuilder().setMaximumAttempts(2).build())
                .setWorkflowId(WORKFLOW_ID)
                .setTaskQueue(TASK_QUEUE)
                .build());

and don’t catch the ChildWorkflowException in your parent workflow code, you should see your workflow being retried ( and have the ContinuedAsNew status as you pointed out).

Got it.
Thankyou Tihomir.

Could you please share a snippet on ussing setFailWorkflowExceptionTypes ? I want to know on which object I need to call this method.

WorkflowImplementationOptions implementationOptions = WorkflowImplementationOptions.newBuilder()
            .setFailWorkflowExceptionTypes(Throwable.class)
            .build();

....

worker.registerWorkflowImplementationType(implementationOptions, <WorkflowClass>.class);

for example if you want to fail the workflow on any exceptions for the particular WorkflowClass

Hi Tihomir,
What do you mean when you say workflow gets blocked?

Will the retry happen infinitely?
How do I control the retries?

Will activity failure exception produce the same effect?

Want to know which are all the default temporal exceptions that produce this behavior.?

What do you mean when you say workflow gets blocked?
Will the retry happen infinitely?
How do I control the retries?

Workflow task is retried up to your defined workflow run/execution timeout, at which point the workflow times out.

Will activity failure exception produce the same effect?

Do you mean if an activity inside the child workflow fails? Yes if it fails, and after the activity exhausted its retries, it will be converted to ActivityFailure. This error since it happened inside a child workflow can be caught inside your workflow code as a ChildWorkflowFailure, that includes ActivityFailure as its cause, which will include the original exception (let’s say a NPE) as it’s cause.

Want to know which are all the default temporal exceptions that produce this behavior.?

All classes that extend TemporalFailure:

TimeoutFailure
ActivityFailure
ApplicationFailure
CancelledFailure
ChildWorkflowFailure
ServerFailure
TerminatedFailure

How about frequency? I don’t want it to retry very frequently and use up resources.? Is it possible to control that?

How about frequency? I don’t want it to retry very frequently and use up resources.? Is it possible to control that?

Currently, it is retried every workflow task timeout. We do have plans to add exponential retry options to this scenario.

1 Like