Workflow activity getting killed when workers scale down

We are testing Temporal workers scaling strategy, and using metrics like temporal_activity_schedule_to_start_latency and workflow_task_schedule_to_start_latency for scaling criteria. We see that scaling up and down happen based on these metrics.

However, when scaling down, I see that sometimes some activities get killed if they do not finish within the cooldown period. Does Temporal automatically realize that those activities resulted in error, and schedule them to run in other worker nodes?

Thank you

Temporal relies on timeouts to detect activity crashes. An activity is considered timed out after the StartToClose (or Heartbeat if specified) timeout. After the timeout, it is retried according to its retry options.

Here are the default activity retry options:

Screenshot 2024-02-21 at 9.42.31 AM

That’s great. Thank you!