I’ll try to explain the scenario we encountered, what I expected and what actually happened:
Scenario:
Started 2 identical workers
Started workflow with a single activity that takes 30 seconds to complete (no real logic, just a simple delay). Activity set with ScheduleToCloseTimeout=5 minutes
Forcibly shut down the worker that started work on the workflow with the activity before the activity completed.
Expected:
Temporal engine will detect that the first one as crashed and that the workflow/activity hasn’t completed and let the second worker take over what’s left of the workflow execution.
Actual:
Workflow failed with a ScheduleToCloseTimeout exception after the timeout expired. No attempt was made to retry the execution.
Does Temporal.io provide any resilience in case of a worker crashing that I’ve missed?
Temporal doesn’t directly detect worker failures. It relies on individual activity timeouts for this. The timeout for a single activity attempt is StartToCloseTimeout. After this timeout expiration activity is retried. You specified only ScheduleToCloseTimeout, which defaulted StartToCloseTimeout to the same 5-minute value. So activity timed out after 5 minutes, but as it already exceeded ScheduleToCloseTimeout, which limits the duration of retries, a failure was reported to the workflow.
If you cannot specify a shorter StartToCloseTimeout as activity duration is unpredictable, you have to specify HeartbeatTimeout. The activity implementation should call the heartbeat at least once per heartbeat timeout. In this case, the failure will be detected after the HeartbeatTimeout.
Does Temporal.io provide any resilience in case of a worker crashing that I’ve missed?
Absolutely. Temporal does not detect worker crash directly. Instead, it relies on a number of timeout configurations to drive retry behaviours. See:
Note that the Temporal Server doesn’t detect Worker process failures directly. It relies on this timeout to detect that an Activity that didn’t complete on time.
To make your example work, you can set the StartToCloseTimeout to e.g. 40 seconds. Or use HeartbeatTimeout if you prefer more timely retries for long running activities.