Activity Retry after Worker restart

cg1972 · July 2, 2021, 6:28am

I had a situation occur today where an activity running in a workflow caused a memory spike and forced our worker service in kubernetes to restart. After the worker service started back up the activity did not get reprocessed and eventually timed out based on the scheduleToCloseTimeout setting.

Would this behaviour be based on the retry settings for that Activity or is there other settings that could be the cause of the activity not being retried again. In this case where the worker becomes unavailable how does temporal know this and reschedule it to a new worker?

Also, is there a better / quicker way to know when the activity has died and needs to be retried - I was thinking that some kind of a timeout based on an activity heartbeat would be a lot quicker than having to wait for the scheduleToCloseTimeout to trigger.

maxim · July 2, 2021, 4:01pm

Temporal doesn’t detect a worker failure directly. It relies on the activity StartToClose timeout. My guess is that you are not setting the StartToClose timeout or it is set to the same value as the ScheduleToClose timeout.

See this blog post and associated video that explains the timeouts.

cg1972 · July 2, 2021, 10:08pm

Thanks Maxim for the information and link - this was exactly what I was looking for.

You were correct, in our case the startToClose and scheduleToClose were both set to the same value so I can see why the Activity did not retry. Is there a general ‘rule of thumb’ for setting these times e.g. if startToClose = x then scheduleToClose should be 3 x startToClose to take into account worker failures, etc.

Using a heartbeat timeout would allow the startToClose and scheduleToClose times to be larger values and still trigger the retries in a timely manner.

maxim · July 2, 2021, 10:30pm

The StartToClose has to be equal to the longest possible activity execution time. If it is long then use HeartbeatTimeout with heartbeating to fail fast. I recommend to not set ScheduleToClose timeout at all. This way an activity is retried until the underlying issue is fixed.

cg1972 · July 2, 2021, 11:08pm

Thanks Maxim, that makes sense

Topic		Replies	Views
Worker does not start activity after restart Community Support go-sdk , retries , worker	17	3531	May 24, 2021
Temporal Queue Activities Community Support java-sdk	11	2938	October 14, 2021
How to test Server outageous Community Support java-sdk	6	1038	June 1, 2021
Activity Recovery: Worker behaviour in case of crash Community Support general-impl , workflow-options , dotnet-sdk	3	1528	May 23, 2023
Heartbeat timeout in Activity Community Support	1	2330	October 9, 2020

Activity Retry after Worker restart

Related topics