I’m using WorkerTuner for one of the task queues in a worker that listens to two queues. This queue processes long-running activities. Over time, I observe that workers stop polling tasks from this queue, and they disappear from the server UI’s worker list for that queue.
To rule out potential WorkerTuner configuration issues, I’ve set both targetMemoryUsage and targetMemoryCPU in ResourceBasedControllerOptions to 1.0 (as per documentation, this should disable CPU/RAM-based scaling influence). At the same time, I have minimumSlots in ResourceBasedSlotOptions set to 2 and maximumSlots to 16.
The other queue processed by the same worker application (but withoutWorkerTuner, using the same WorkflowClient) does not exhibit this problem. That worker continues polling tasks from it even when the WorkerTuner-enabled worker gets stuck.
In the worker metrics, when it stops polling, I see temporal_worker_task_slots_used for the affected queue gradually dropping to 0 and not increasing again — meaning the activity code itself is not hanging. Moreover, during this period (while the slot count remains at 0), the server allows already-running activities from this worker to complete; they send heartbeats and finish successfully. However, the worker still does not pick up new tasks.
Unfortunately, I haven’t found other useful metrics when using WorkerTuner (e.g., temporal_worker_task_slots_available is not exposed for such queues). I’m not sure where else to look to understand what’s happening.
I turned on all all TRACE logs for the io.temporal package and the last log related to the queue when worker stops polling new tasks from it is the following