Workflow activity getting killed when workers scale down

vinaya · February 21, 2024, 5:35pm

We are testing Temporal workers scaling strategy, and using metrics like temporal_activity_schedule_to_start_latency and workflow_task_schedule_to_start_latency for scaling criteria. We see that scaling up and down happen based on these metrics.

However, when scaling down, I see that sometimes some activities get killed if they do not finish within the cooldown period. Does Temporal automatically realize that those activities resulted in error, and schedule them to run in other worker nodes?

Thank you

maxim · February 21, 2024, 5:43pm

Temporal relies on timeouts to detect activity crashes. An activity is considered timed out after the StartToClose (or Heartbeat if specified) timeout. After the timeout, it is retried according to its retry options.

Here are the default activity retry options:

Screenshot 2024-02-21 at 9.42.31 AM

vinaya · February 22, 2024, 1:35am

That’s great. Thank you!

Topic		Replies	Views
Activity Recovery: Worker behaviour in case of crash Community Support general-impl , workflow-options , dotnet-sdk	3	1361	May 23, 2023
Temporal worker node failure detection Community Support	7	1532	February 11, 2021
Worker does not start activity after restart Community Support go-sdk , retries , worker	17	3330	May 24, 2021
Activity Retry after Worker restart Community Support retries	4	837	July 2, 2021
Activity not recovered after worker restarted Community Support go-sdk , general-impl	3	878	February 9, 2023

Workflow activity getting killed when workers scale down

Related topics