Activity Recovery: Worker behaviour in case of crash

Liad_Davidson · May 21, 2023, 7:30am

Hi All,

I’ll try to explain the scenario we encountered, what I expected and what actually happened:

Scenario:

Started 2 identical workers
Started workflow with a single activity that takes 30 seconds to complete (no real logic, just a simple delay). Activity set with ScheduleToCloseTimeout=5 minutes
Forcibly shut down the worker that started work on the workflow with the activity before the activity completed.

Expected:
Temporal engine will detect that the first one as crashed and that the workflow/activity hasn’t completed and let the second worker take over what’s left of the workflow execution.

Actual:
Workflow failed with a ScheduleToCloseTimeout exception after the timeout expired. No attempt was made to retry the execution.

Does Temporal.io provide any resilience in case of a worker crashing that I’ve missed?

I’m using the alpha release of the dotnet-sdk.

Thanks.

maxim · May 21, 2023, 11:28pm

Temporal doesn’t directly detect worker failures. It relies on individual activity timeouts for this. The timeout for a single activity attempt is StartToCloseTimeout. After this timeout expiration activity is retried. You specified only ScheduleToCloseTimeout, which defaulted StartToCloseTimeout to the same 5-minute value. So activity timed out after 5 minutes, but as it already exceeded ScheduleToCloseTimeout, which limits the duration of retries, a failure was reported to the workflow.

If you cannot specify a shorter StartToCloseTimeout as activity duration is unpredictable, you have to specify HeartbeatTimeout. The activity implementation should call the heartbeat at least once per heartbeat timeout. In this case, the failure will be detected after the HeartbeatTimeout.

See the activity documentation that describes all types of timeout in more details.

taonic · May 21, 2023, 11:44pm

Late to Maxim’s answer refer to his first.

Does Temporal.io provide any resilience in case of a worker crashing that I’ve missed?

Absolutely. Temporal does not detect worker crash directly. Instead, it relies on a number of timeout configurations to drive retry behaviours. See:

Note that the Temporal Server doesn’t detect Worker process failures directly. It relies on this timeout to detect that an Activity that didn’t complete on time.

https://www.javadoc.io/doc/io.temporal/temporal-sdk/latest/io/temporal/activity/ActivityOptions.Builder.html#setStartToCloseTimeout(java.time.Duration)

There are two activity timeout configs that tells Temporal when to perform retries: HeartbeatTimeout and StartToCloseTimeout. Note, the one used in your example ScheduleToCloseTimeout tells Temporal when to stop retry.

To make your example work, you can set the StartToCloseTimeout to e.g. 40 seconds. Or use HeartbeatTimeout if you prefer more timely retries for long running activities.

Liad_Davidson · May 23, 2023, 12:32pm

Thanks!

This information along with @taonic 's addition seems to have resolve my issue.

Topic		Replies	Views
Activity Retry after Worker restart Community Support retries	4	856	July 2, 2021
Temporal worker node failure detection Community Support	7	1553	February 11, 2021
Workflow activity getting killed when workers scale down Server Deployment go-sdk , scaling , worker	2	218	February 22, 2024
Worker does not start activity after restart Community Support go-sdk , retries , worker	17	3393	May 24, 2021
How to test Server outageous Community Support java-sdk	6	1027	June 1, 2021

Activity Recovery: Worker behaviour in case of crash

Related topics