Workflow and Activity timeout

Hi,
my questions are:

  1. if some workflow has default execution timeout (which is infinite), and the pod on which it’s running on crashes, will it be retried? if yes - how and when? if not - should I set the timeout to some finite time?

  2. regarding activity, you ay that start-to-close must be set in order to reach timeout, but doesn’t the heartbeat do this work? namely, is start-to-close is infinite, and heartbeat is 20s, won’t it solve the issue of crashing and retrying?

  3. for activity, if maxRetries =1 and the pod crashes, it means that there will be no retries?

Thanks,
Shai

if some workflow has default execution timeout (which is infinite), and the pod on which it’s running on crashes, will it be retried?

Workflow executions are not tied to a specific worker. In the case a worker crashes, your executions can be continued on a worker in a different process. Would recommend watching this video where Maxim explains this in more details.

Unlike activities, workflows do not have a default retry policy and you have to specifically enable it via WorkflowOptions. If you enable workflow retries and your workflow execution fails or times out, it will be retried up to the WorkflowExecutionTimeout (or “infinitely” if you don’t specify it). In case your worker process is down it can be retried on a different one.

By default workflows do not fail on intermittent errors but block workflow execution waiting on a fix. You don’t need to set up workflow retries in case of worker crash, as again execution can be continued on a different worker process.

regarding activity, you say that start-to-close must be set in order to reach timeout

For activities you have to set either StartToClose or ScheduleToClose timeouts, see this video for more info. Activities have a default retry policy, so in your case where a worker crashes, it would be retried on a different worker process.
Activities are retried up to the set ScheduleToClose timeout. When activity retries are exhausted ActivityFailure is delivered to your workflow code.

namely, is start-to-close is infinite, and heartbeat is 20s, won’t it solve the issue of crashing and retrying?

For long running activities (when you have a long StartToClose timeout), heartbeat timeout can be used to detect activity failures quickly, see here for more info. In this case again in case of worker crash, your activity would be able to be retried on a different worker process and continue execution.

for activity, if maxRetries =1 and the pod crashes, it means that there will be no retries?

Setting maxRetries in ActivityOptions to 1 disables activity retries. In case of worker failure execution can be continued on a different worker process.
We don’t recommend limiting retries via maxRetries, but via ActivityOptions ScheduleToClose timeout.

thanks for your reply.

so you are saying that pod crash is NOT considered as an activity failure, but a case on which the worker continues on another pod upon crash detection. correct?

Thanks,
Shai

Yes you workflow execution would be able to be continued on a different worker that is polling on that task queue (and has the workflow and activity registered) in this case as well.

crash is NOT considered as an activity failure

Temporal relies on timeouts only to detect an activity failure. So pod crash by itself doesn’t lead to an activity failure. But exceeding StartToClose timeout causes an activity failure and retry on a different pod.

hi Maxim,
Thanks for your reply.

Can you please elaborate more about what happens when pod is crashing in the middle of an activity?
How the activity is “continued” on another pod\worker? from the beginning?
asking because our functions are transactional and it will work consistently only if the activity is “replayed” from its beginning!!

Thank,
Shai

  1. Activity starts executing
  2. Pod crashes
  3. Activity StartToClose timeout fires
  4. Activity is rescheduled according to its exponential retry policy
  5. Activity starts executing from the beginning on a different pod

thanks!
Shai