Workflow and Activity timeout

Hi,
my questions are:

  1. if some workflow has default execution timeout (which is infinite), and the pod on which it’s running on crashes, will it be retried? if yes - how and when? if not - should I set the timeout to some finite time?

  2. regarding activity, you ay that start-to-close must be set in order to reach timeout, but doesn’t the heartbeat do this work? namely, is start-to-close is infinite, and heartbeat is 20s, won’t it solve the issue of crashing and retrying?

  3. for activity, if maxRetries =1 and the pod crashes, it means that there will be no retries?

Thanks,
Shai

1 Like

if some workflow has default execution timeout (which is infinite), and the pod on which it’s running on crashes, will it be retried?

Workflow executions are not tied to a specific worker. In the case a worker crashes, your executions can be continued on a worker in a different process. Would recommend watching this video where Maxim explains this in more details.

Unlike activities, workflows do not have a default retry policy and you have to specifically enable it via WorkflowOptions. If you enable workflow retries and your workflow execution fails or times out, it will be retried up to the WorkflowExecutionTimeout (or “infinitely” if you don’t specify it). In case your worker process is down it can be retried on a different one.

By default workflows do not fail on intermittent errors but block workflow execution waiting on a fix. You don’t need to set up workflow retries in case of worker crash, as again execution can be continued on a different worker process.

regarding activity, you say that start-to-close must be set in order to reach timeout

For activities you have to set either StartToClose or ScheduleToClose timeouts, see this video for more info. Activities have a default retry policy, so in your case where a worker crashes, it would be retried on a different worker process.
Activities are retried up to the set ScheduleToClose timeout. When activity retries are exhausted ActivityFailure is delivered to your workflow code.

namely, is start-to-close is infinite, and heartbeat is 20s, won’t it solve the issue of crashing and retrying?

For long running activities (when you have a long StartToClose timeout), heartbeat timeout can be used to detect activity failures quickly, see here for more info. In this case again in case of worker crash, your activity would be able to be retried on a different worker process and continue execution.

for activity, if maxRetries =1 and the pod crashes, it means that there will be no retries?

Setting maxRetries in ActivityOptions to 1 disables activity retries. In case of worker failure execution can be continued on a different worker process.
We don’t recommend limiting retries via maxRetries, but via ActivityOptions ScheduleToClose timeout.

thanks for your reply.

so you are saying that pod crash is NOT considered as an activity failure, but a case on which the worker continues on another pod upon crash detection. correct?

Thanks,
Shai

Yes you workflow execution would be able to be continued on a different worker that is polling on that task queue (and has the workflow and activity registered) in this case as well.

crash is NOT considered as an activity failure

Temporal relies on timeouts only to detect an activity failure. So pod crash by itself doesn’t lead to an activity failure. But exceeding StartToClose timeout causes an activity failure and retry on a different pod.

hi Maxim,
Thanks for your reply.

Can you please elaborate more about what happens when pod is crashing in the middle of an activity?
How the activity is “continued” on another pod\worker? from the beginning?
asking because our functions are transactional and it will work consistently only if the activity is “replayed” from its beginning!!

Thank,
Shai

  1. Activity starts executing
  2. Pod crashes
  3. Activity StartToClose timeout fires
  4. Activity is rescheduled according to its exponential retry policy
  5. Activity starts executing from the beginning on a different pod

thanks!
Shai

Consider the following workflow example (similar to what is written in the temporal samples repo Frequently Polling Activity):

class MyWorkflow():
    execute_activity(my_activity_function, heartbeat_timeout=30.seconds)

In this scenario, the activity my_activity_function sends a heartbeat on each iteration of a loop, as shown below (just a pseudo code):

def my_activity_function():
    while True:
        poll_something_from_database()

Typically, poll_something_from_database() takes between 5 to 20 seconds. However, due to occasional database slowdowns, there might be instances where it takes longer, such as 60 seconds.

Now, my concern is that when the heartbeat timeout of 30 seconds is reached, it would be necessary to terminate and force kill the activity before the Temporal cluster starts a new instance on another worker.

Does temporal itself perform this action and terminate the activity in the current worker before transferring it to another worker? The documents do not mention anything about this scenario and only take into worker crashes.

Heartbeat will throw when the next heartbeat fails due to the timeout. See the sample.

If 60 seconds latency is expected, I would increase the heartbeat timeout to be longer than 60 seconds.

1 Like