Understanding workflow retries and failures

I’ve read https://docs.temporal.io/docs/learn-workflows#workflow-retries , but I don’t think I fully understand it yet.

Given a workflow that is executed with all the default options. It executes an activity (which always fails) with the following context:

activityCtx := workflow.WithActivityOptions(ctx, workflow.ActivityOptions{
  ScheduleToStartTimeout: time.Hour,
  StartToCloseTimeout:    time.Hour,
  RetryPolicy: &temporal.RetryPolicy{
    InitialInterval:        time.Minute,
    BackoffCoefficient:     1.0,
  },
})

My expectations are that the workflow will be executed once. And the activity 60 times (since there are 60 minutes in 1 hour). And the total time it takes before permanent workflow failure is 1 hour. Is that correct?

What if you now have non-default options for the workflow (same activity):

client.StartWorkflowOptions{
  RetryPolicy: &temporal.RetryPolicy{
    InitialInterval:    time.Minute,
    BackoffCoefficient: 1.0,
    MaximumAttempts:    100,
  }
}

My expectations are that the workflow will now execute 100 times and the activity 6000 times. And the total time it takes before permanent workflow failure is 100 hours. Is that correct?

When a workflow is retried, does it start with a clean event history (ContinueAsNew) or does it start with the previous event history?

1 Like

Let’s start with the activity retries. In Temporal, activity retries are limited by the value of ScheduleToClose timeout. As this timeout is not specified it defaults to the workflow run timeout. The workflow run timeout defaults to a very large value around 10 years at this point.

So the expected behavior in your case is the activity being retried practically forever.

Also, note that Temporal by default provides a RetryPolicy for an activity (but not for a workflow). The default activity retry policy is

InitialIntervalInSeconds:1,
BackoffCoefficient:2,
MaximumIntervalInSeconds: unlimited,
MaximumAttempts: unlimited,
NonRetryableErrorTypes: none.

This actually looks broken to me as MaximumInterval should not be unlimited, but be something like 100x of initial. We’ll take care of this ASAP.

Aah, I didn’t know that. I’m using 0.26.0 and for that version the comment on that field is:

		// ScheduleToCloseTimeout - The end to end timeout for the activity needed.
		// The zero value of this uses default value.
		// Optional: The default value is the sum of ScheduleToStartTimeout and StartToCloseTimeout
		ScheduleToCloseTimeout time.Duration

I thought it would then be set to 2 hours in my example because I set the other two timeouts (which still makes my calculations wrong… because I mentioned ‘1 hour’ instead of 2…).


If an activity reached the ScheduleToClose timeout, will it permanently fail the workflow if the workflow hasn’t got a retry policy?

And if a workflow has a retry policy, what happens to any activities executed before the failing activity? Will events from the previous workflow attempt still apply to the new workflow attempt? Or will every workflow-level retry start with a clean event log (like ContinueAsNew)?

Yes, we are going to update the comments and fix the default policy values.

If activity reached ScheduleToClose limit it is going to return an error (or throw an exception in Java SDK) and it is up to your code to decide what to do about it. If your code returns that error from the workflow function then the workflow is going to fail. If workflow has an associated retry policy it is going to be retried up to WorkflowExecutionTimeout (or max attempts). On each retry, the workflow is executed from the beginning which means that all its logic is reexecuted including already executed activities.

The general advice is to avoid failing workflows on intermittent errors by specifying very long ScheduleToClose activity timeouts. It is not obvious, but in the context of a workflow, an error that requires a new deployment or even code fix can be treated as intermittent.

2 Likes