Workers not polling for tasks

We are using go sdk version v1.25.1.
The number of task slots available keep reducing continously without increase in load, the number of pollers of task queue are also less than the number of pods running.


When we restart the k8s deployment where workers are running, the slots become available.
Is there a way to find why task pollers become unavailable, even when pods are in healthy condition. Is there a way to monitor this internally from the worker pod, and restart pod when it is not polling the task queue?

Is there a way to find why task pollers become unavailable, even when pods are in healthy

your worker can only poll for tasks when it has capacity (available activity task slots). so if something in your activity code blocks (such as calls do downstream apis) and activity does not release task slot, they can get filled up and worker would stop polling for more activity tasks until at least one activity completes.

so once available task slots become 0 your worker would stop polling for activity tasks.

Is there a way to monitor this internally

i think you should debug why your activity code is blocking, if you create connections to any downstreams/db would consider adding timeout on this connection that is less than the activity StartToClose timeout. if this connection then times out, fail the activity so it can retry

Is there a way to monitor this internally

server metrics can help you view activity starttoclose timeouts on namespace level

sum(rate(start_to_close_timeout{operation="TimerActiveTaskActivityTimeout"}[5m])) by(namespace,operation)

you should also alert on temporal_worker_task_slots_available dropping to zero

Thanks @tihomir it makes sense the internal db calls might block causing this. However in case when activity blocks, can it be marked as completed which we see on temporal dashboard.
Basically we see fewer workflows in running state compared to total task slots consumed.

However in case when activity blocks, can it be marked as completed which we see on temporal dashboard.

would depend on your activity retry policy. activity timeout would cause retry. can you share your activity options?

Basically we see fewer workflows in running state compared to total task slots consumed.

if you are talking about your activity task slots as in initial question then i think there is no 1-1 relationship here. your activities that keep timing out and dont release task slots are being retried, causing more activity tasks to get stuf, and you are seeing executor slot numbers depleting

The reason we found was - The activity being run here was a long running activity, and at a certain step due to race condition it was getting stuck because of a blocking fs call, since activity heartbeat timed out it got scheduled on other pod where it completed successfully but the original pod where the race condition happened never released task slot and overtime task slots kept reducing because this race condition was not very rare/happens frequently.

1 Like