Workflow and Activity timeout

Shai_Zaban · May 19, 2022, 11:20pm

Hi,
my questions are:

if some workflow has default execution timeout (which is infinite), and the pod on which it’s running on crashes, will it be retried? if yes - how and when? if not - should I set the timeout to some finite time?
regarding activity, you ay that start-to-close must be set in order to reach timeout, but doesn’t the heartbeat do this work? namely, is start-to-close is infinite, and heartbeat is 20s, won’t it solve the issue of crashing and retrying?
for activity, if maxRetries =1 and the pod crashes, it means that there will be no retries?

Thanks,
Shai

tihomir · May 20, 2022, 7:44pm

if some workflow has default execution timeout (which is infinite), and the pod on which it’s running on crashes, will it be retried?

Workflow executions are not tied to a specific worker. In the case a worker crashes, your executions can be continued on a worker in a different process. Would recommend watching this video where Maxim explains this in more details.

Unlike activities, workflows do not have a default retry policy and you have to specifically enable it via WorkflowOptions. If you enable workflow retries and your workflow execution fails or times out, it will be retried up to the WorkflowExecutionTimeout (or “infinitely” if you don’t specify it). In case your worker process is down it can be retried on a different one.

By default workflows do not fail on intermittent errors but block workflow execution waiting on a fix. You don’t need to set up workflow retries in case of worker crash, as again execution can be continued on a different worker process.

regarding activity, you say that start-to-close must be set in order to reach timeout

For activities you have to set either StartToClose or ScheduleToClose timeouts, see this video for more info. Activities have a default retry policy, so in your case where a worker crashes, it would be retried on a different worker process.
Activities are retried up to the set ScheduleToClose timeout. When activity retries are exhausted ActivityFailure is delivered to your workflow code.

namely, is start-to-close is infinite, and heartbeat is 20s, won’t it solve the issue of crashing and retrying?

For long running activities (when you have a long StartToClose timeout), heartbeat timeout can be used to detect activity failures quickly, see here for more info. In this case again in case of worker crash, your activity would be able to be retried on a different worker process and continue execution.

for activity, if maxRetries =1 and the pod crashes, it means that there will be no retries?

Setting maxRetries in ActivityOptions to 1 disables activity retries. In case of worker failure execution can be continued on a different worker process.
We don’t recommend limiting retries via maxRetries, but via ActivityOptions ScheduleToClose timeout.

Shai_Zaban · May 27, 2022, 12:16am

thanks for your reply.

so you are saying that pod crash is NOT considered as an activity failure, but a case on which the worker continues on another pod upon crash detection. correct?

Thanks,
Shai

tihomir · May 27, 2022, 7:49pm

Yes you workflow execution would be able to be continued on a different worker that is polling on that task queue (and has the workflow and activity registered) in this case as well.

maxim · May 27, 2022, 8:02pm

crash is NOT considered as an activity failure

Temporal relies on timeouts only to detect an activity failure. So pod crash by itself doesn’t lead to an activity failure. But exceeding StartToClose timeout causes an activity failure and retry on a different pod.

Shai_Zaban · June 9, 2022, 10:57pm

hi Maxim,
Thanks for your reply.

Can you please elaborate more about what happens when pod is crashing in the middle of an activity?
How the activity is “continued” on another pod\worker? from the beginning?
asking because our functions are transactional and it will work consistently only if the activity is “replayed” from its beginning!!

Thank,
Shai

maxim · June 13, 2022, 6:01pm

Activity starts executing
Pod crashes
Activity StartToClose timeout fires
Activity is rescheduled according to its exponential retry policy
Activity starts executing from the beginning on a different pod

Shai_Zaban · June 14, 2022, 6:29am

thanks!
Shai

amir · July 4, 2023, 3:59pm

Consider the following workflow example (similar to what is written in the temporal samples repo Frequently Polling Activity):

class MyWorkflow():
    execute_activity(my_activity_function, heartbeat_timeout=30.seconds)

In this scenario, the activity my_activity_function sends a heartbeat on each iteration of a loop, as shown below (just a pseudo code):

def my_activity_function():
    while True:
        poll_something_from_database()

Typically, poll_something_from_database() takes between 5 to 20 seconds. However, due to occasional database slowdowns, there might be instances where it takes longer, such as 60 seconds.

Now, my concern is that when the heartbeat timeout of 30 seconds is reached, it would be necessary to terminate and force kill the activity before the Temporal cluster starts a new instance on another worker.

Does temporal itself perform this action and terminate the activity in the current worker before transferring it to another worker? The documents do not mention anything about this scenario and only take into worker crashes.

maxim · July 4, 2023, 4:08pm

Heartbeat will throw when the next heartbeat fails due to the timeout. See the sample.

If 60 seconds latency is expected, I would increase the heartbeat timeout to be longer than 60 seconds.

Topic		Replies	Views
Workflow not recovered after crash Community Support	12	1289	March 24, 2021
Activity Recovery: Worker behaviour in case of crash Community Support general-impl , workflow-options , dotnet-sdk	3	1446	May 23, 2023
Temporal Queue Activities Community Support java-sdk	11	2781	October 14, 2021
When it startToCloseTimeout happens - Will Temporal Restart the worker server on a system crash? Community Support retries , typescript-sdk	4	59	June 7, 2025
Activity Retry after Worker restart Community Support retries	4	863	July 2, 2021

Workflow and Activity timeout

Related topics