Activity Recovery: Worker behaviour in case of crash

Hi All,

I’ll try to explain the scenario we encountered, what I expected and what actually happened:

Scenario:

  1. Started 2 identical workers
  2. Started workflow with a single activity that takes 30 seconds to complete (no real logic, just a simple delay). Activity set with ScheduleToCloseTimeout=5 minutes
  3. Forcibly shut down the worker that started work on the workflow with the activity before the activity completed.

Expected:
Temporal engine will detect that the first one as crashed and that the workflow/activity hasn’t completed and let the second worker take over what’s left of the workflow execution.

Actual:
Workflow failed with a ScheduleToCloseTimeout exception after the timeout expired. No attempt was made to retry the execution.

Does Temporal.io provide any resilience in case of a worker crashing that I’ve missed?

I’m using the alpha release of the dotnet-sdk.

Thanks.

Temporal doesn’t directly detect worker failures. It relies on individual activity timeouts for this. The timeout for a single activity attempt is StartToCloseTimeout. After this timeout expiration activity is retried. You specified only ScheduleToCloseTimeout, which defaulted StartToCloseTimeout to the same 5-minute value. So activity timed out after 5 minutes, but as it already exceeded ScheduleToCloseTimeout, which limits the duration of retries, a failure was reported to the workflow.

If you cannot specify a shorter StartToCloseTimeout as activity duration is unpredictable, you have to specify HeartbeatTimeout. The activity implementation should call the heartbeat at least once per heartbeat timeout. In this case, the failure will be detected after the HeartbeatTimeout.

See the activity documentation that describes all types of timeout in more details.

1 Like

Late to Maxim’s answer refer to his first.

Does Temporal.io provide any resilience in case of a worker crashing that I’ve missed?

Absolutely. Temporal does not detect worker crash directly. Instead, it relies on a number of timeout configurations to drive retry behaviours. See:

Note that the Temporal Server doesn’t detect Worker process failures directly. It relies on this timeout to detect that an Activity that didn’t complete on time.

https://www.javadoc.io/doc/io.temporal/temporal-sdk/latest/io/temporal/activity/ActivityOptions.Builder.html#setStartToCloseTimeout(java.time.Duration)

There are two activity timeout configs that tells Temporal when to perform retries: HeartbeatTimeout and StartToCloseTimeout. Note, the one used in your example ScheduleToCloseTimeout tells Temporal when to stop retry.

To make your example work, you can set the StartToCloseTimeout to e.g. 40 seconds. Or use HeartbeatTimeout if you prefer more timely retries for long running activities.

Thanks!

This information along with @taonic 's addition seems to have resolve my issue.