Auto Scaling nightmares with Temporal, Suggestions needed

I’m running self-hosted Temporal on Kubernetes and exploring autoscaling strategies for workers.

So far, I’ve tried:

1. KEDA (backlog-based scaling)
It scales based on task queue backlog. The issue is that during spikes, it scales up aggressively even though the workload could actually be handled by fewer workers.

2. Slot-based scaling
I’m still trying to fully understand what each “slot” represents and how it translates to actual concurrent execution per worker.

3. Start-latency-based scaling
This doesn’t work well for my use case, since it requires waiting for latency to increase before scaling happens.

In all approaches, I’m facing the same issue:

  • Workers scale up.

  • Activities are distributed across all pods.

  • The autoscaler detects underutilization and scales down.

  • Pods are terminated while running activities.

  • Activities retry on new pods.

  • Those pods may also get terminated (they are mostly part of the pool that just got scaled up because of load).

  • Retry limits get exhausted.

  • Workflows eventually fail.

Is there a recommended pattern for graceful scale-down of Temporal workers on k8s?

Ideally, when a pod is about to terminate, it should:

  • Stop polling for new tasks.

  • Finish in-flight activities.

  • Exit cleanly.

I would prefer to use prestop hook to trigger graceful shutdown, and let temporal cleanly exit and scale down, rather than letting k8s kill the pod/.

How are others handling these? Or am i doing / understanding something wrong here?

Using temporal python sdk here.