HELP - Workers not pooling for task. Task queue are decreasing in workers until 0

We are facing the same issue that is commented in this topic: Workers not polling for tasks

We are trying to identify the root cause of the issue in our case. We haven’t found yet.

We are using python sdk 1.18.1. with all default values provided by sdk 1.18.1 here: https://github.com/temporalio/sdk-python/blob/1.18.0/temporalio/worker/\_worker.py

“The number of task slots available keep reducing continuously without increase in load, the number of pollers of task queue are also less than the number of pods running. When we restart the k8s deployment where workers are running, the slots become available.” We can’t see all the workers that is running in k8s and after a while they start to disappear. After a couple of hours all the workers (around 40) disappear and we need to manually restart all the pods to back online. The pods are bigger than usual. 6GB and 4 CPU each one.

Do you have any help to try to identify the root cause? we don’t have code blocking the activities and workflows.

Typically with this description you gave, especially along with

When we restart the k8s deployment where workers are running, the slots become available.

indicates that something in activity code is blocking (most of time have seen related to some connection or waiting for response from a downstream) that causes activity timeouts and retries.
Can also be high worker cpu that then slows down compute, again leading to activity timeouts.

Do you have worker (sdk) metrics? Do you have your worker pod resource utilization metrics?