We our temporal-worker deployment is configured to run with 3 replicas.
When reviewing the CPU usage I can see that I have 1 pod running on 100% CPU usage while the other 2 is running on 5% CPU usage.
when looking at the logs of the 3 pods I can see that the when that is working hard is printing “deleted history garbage” all the time while the other 2 is not printing anything at all.
worker_task_slots_available is around 800
Whats your use case? It sounds as might be small number of workflow executions that might be long-running / have a large number of updates but please confirm.
Server tries to send updates to already running executions to same worker that is so far processing it (sticky task queue).
Do you have sdk (worker) metrics configured? If you do can you check sticky_cache_size
between the worker pods as well as check gc times for each?