Temporal worker node failure detection

I am trying to understand how Temporal handles worker node failures that run workflows and activities.

I created an activity with an artificial delay of 500ms and executed in 30 times in a workflow. This is the ActivityOptions I used:

ActivityOptions options = ActivityOptions.newBuilder()
        .setScheduleToCloseTimeout(Duration.ofSeconds(5))
        .setTaskQueue(Constants.TASK_QUEUE)
        .build();

Since I have used a timeout of 5 seconds for the activity, I thought Temporal will take 5 seconds to detect a node failure and will rerun the activity automatically in a different node. No retry configuration will be required for this. Please let me know if my understanding is correct.

In my test setup, I started with 2 worker nodes and executed 10 activities. In the middle of the execution, I brought down 1 worker node. I found that 4 workflows got completed successfully. 6 workflows failed with the following error. Not sure what am I missing here, as per my understanding all the workflows should have completed successfully since one worked node is still available.

seems that my memory needs to be refreshed:
You need to specifically configure an retry policy or activity will only be tried once

above the setScheduleToCloseTimeout is set to 5s, which means overall the activity will be tries for 5s only.

try to use start to close timeout instead

Thanks.

If I understood correctly, this is the configuration I should use:

ActivityOptions options = ActivityOptions.newBuilder()
                .setStartToCloseTimeout(Duration.ofSeconds(5))
                .setTaskQueue(ConstantsTASK_QUEUE)
                .build();

Also, just to confirm that my understanding is correct, RetryOptions has nothing to do with node failure detection and rescheduling activities in a different node. It just determines how many times an activity will be retried in a single worker node once the node has started executing the activity. Please let me know if this is correct.

Temporal doesn’t perform any worker failure detection at this point. All failures are detected through timeouts only. So if worker goes down the activity that runs on it will timeout after the StartToClose timeout. What happens after the timeout and other failures is defined by the retry options. If retry policy tells that activity has to retry it might be retried on any worker that polls on that task queue. It can be the same worker or any other.

An activity by default gets retry options with the following parameters:

  • InitialInterval = 1 second
  • BackoffCoefficient = 2.0
  • MaximumInterval = 100 * InitialInterval
  • MaximumAttempts = No limit
1 Like

Thanks, Maxim for explaining this.

Does Temporal use a similar mechanism for workflow tasks also? If yes, what are the timeouts and retry configurations for workflow tasks? Is it something we can modify using the Temporal API?

@amlanroy1980, overall behavior is similar for workflow tasks, the main difference is that workflow task timeout defaults to 10 seconds and can be specified in the WorkflowOptions, also server would retry workflow tasks indefinitely (until the workflow times out). Note that normally you don’t want to use long running workflow tasks as they should only be used as a control code and all heavy lifting should be done in activities.

Thanks, Vitaly.

If we consider the folowing configuration,

WorkflowOptions workflowOptions = WorkflowOptions.newBuilder()
.setWorkflowExecutionTimeout(Duration.ofSeconds(600))
.setWorkflowTaskTimeout(Duration.ofSeconds(5))
.setTaskQueue(Constants.TASK_QUEUE)
.build();

each workflow task has a timeout of 5 seconds and the end-to-end workflow execution takes 10 mins. Is that correct?

Does this mean if any worker node crashes, the workflow tasks running in it will be restarted only after 5 seconds?

Given that workflow tasks are generally not long-running, can we configure it to 1 second for faster recovery from node failures?

Are local activity execution times also included in the workflow task execution time?

What is setWorkflowRunTimeout() used for?

each workflow task has a timeout of 5 seconds and the end-to-end workflow execution takes 10 mins. Is that correct?

Correct

Does this mean if any worker node crashes, the workflow tasks running in it will be restarted only after 5 seconds?

Correct.

Given that workflow tasks are generally not long-running, can we configure it to 1 second for faster recovery from node failures?

Note that workflow task timeout includes time to deliver the task to the worker including loading the history which might involve multiple round trips if the history is large. If you set the timeout shorter than the load history time you might end up in a situation when tasks will timeout before the workflow task handler is invoked. As workflow task timeout is immutable once a workflow has started this situation is not recoverable.

Are local activity execution times also included in the workflow task execution time?

Yes, they are included. But if a local activity is longer than a workflow task the task is completed and reopened in the background. It acts as a workflow task heartbeating.

What is setWorkflowRunTimeout() used for?

WorkflowRunTimeout limits the time of a single workflow run.
WorkflowExecutionTimeout limits overall workflow execution time which includes all the continue as new calls and workflow retries.

For example, setting WorkflowExecutionTimeout stops cron execution after the specified timeout. And setting WorkflowRunTimeout limits a single cron invocation duration.