I had a situation occur today where an activity running in a workflow caused a memory spike and forced our worker service in kubernetes to restart. After the worker service started back up the activity did not get reprocessed and eventually timed out based on the scheduleToCloseTimeout setting.
Would this behaviour be based on the retry settings for that Activity or is there other settings that could be the cause of the activity not being retried again. In this case where the worker becomes unavailable how does temporal know this and reschedule it to a new worker?
Also, is there a better / quicker way to know when the activity has died and needs to be retried - I was thinking that some kind of a timeout based on an activity heartbeat would be a lot quicker than having to wait for the scheduleToCloseTimeout to trigger.
Temporal doesn’t detect a worker failure directly. It relies on the activity StartToClose timeout. My guess is that you are not setting the StartToClose timeout or it is set to the same value as the ScheduleToClose timeout.
See this blog post and associated video that explains the timeouts.
Thanks Maxim for the information and link - this was exactly what I was looking for.
You were correct, in our case the startToClose and scheduleToClose were both set to the same value so I can see why the Activity did not retry. Is there a general ‘rule of thumb’ for setting these times e.g. if startToClose = x then scheduleToClose should be 3 x startToClose to take into account worker failures, etc.
Using a heartbeat timeout would allow the startToClose and scheduleToClose times to be larger values and still trigger the retries in a timely manner.
The StartToClose has to be equal to the longest possible activity execution time. If it is long then use HeartbeatTimeout with heartbeating to fail fast. I recommend to not set ScheduleToClose timeout at all. This way an activity is retried until the underlying issue is fixed.