Workers not polling for tasks

Shantanu_Sharma · February 6, 2025, 7:28am

We are using go sdk version v1.25.1.
The number of task slots available keep reducing continously without increase in load, the number of pollers of task queue are also less than the number of pods running.

When we restart the k8s deployment where workers are running, the slots become available.
Is there a way to find why task pollers become unavailable, even when pods are in healthy condition. Is there a way to monitor this internally from the worker pod, and restart pod when it is not polling the task queue?

tihomir · February 6, 2025, 3:03pm

Is there a way to find why task pollers become unavailable, even when pods are in healthy

your worker can only poll for tasks when it has capacity (available activity task slots). so if something in your activity code blocks (such as calls do downstream apis) and activity does not release task slot, they can get filled up and worker would stop polling for more activity tasks until at least one activity completes.

so once available task slots become 0 your worker would stop polling for activity tasks.

Is there a way to monitor this internally

i think you should debug why your activity code is blocking, if you create connections to any downstreams/db would consider adding timeout on this connection that is less than the activity StartToClose timeout. if this connection then times out, fail the activity so it can retry

tihomir · February 6, 2025, 3:06pm

Is there a way to monitor this internally

server metrics can help you view activity starttoclose timeouts on namespace level

sum(rate(start_to_close_timeout{operation="TimerActiveTaskActivityTimeout"}[5m])) by(namespace,operation)

you should also alert on temporal_worker_task_slots_available dropping to zero

Shantanu_Sharma · February 7, 2025, 5:26am

Thanks @tihomir it makes sense the internal db calls might block causing this. However in case when activity blocks, can it be marked as completed which we see on temporal dashboard.
Basically we see fewer workflows in running state compared to total task slots consumed.

tihomir · February 8, 2025, 4:31pm

However in case when activity blocks, can it be marked as completed which we see on temporal dashboard.

would depend on your activity retry policy. activity timeout would cause retry. can you share your activity options?

Basically we see fewer workflows in running state compared to total task slots consumed.

if you are talking about your activity task slots as in initial question then i think there is no 1-1 relationship here. your activities that keep timing out and dont release task slots are being retried, causing more activity tasks to get stuf, and you are seeing executor slot numbers depleting

Shantanu_Sharma · February 10, 2025, 5:20am

The reason we found was - The activity being run here was a long running activity, and at a certain step due to race condition it was getting stuck because of a blocking fs call, since activity heartbeat timed out it got scheduled on other pod where it completed successfully but the original pod where the race condition happened never released task slot and overtime task slots kept reducing because this race condition was not very rare/happens frequently.

Topic		Replies	Views
Activity Workers Not Recovering Task Slots, Eventually Leading to Zero Availability Community Support go-sdk	6	1102	June 9, 2023
Worker Failed to poll for task Community Support go-sdk , worker	5	3309	February 28, 2025
Not recovering from `Failed to poll for task` Community Support go-sdk	0	191	May 24, 2024
Worker: Failed to poll for task Community Support go-sdk , temporal-cloud	3	2158	January 12, 2023
Incorrect task slots getting created in K8S multi pod deployment Community Support java-sdk , springboot , kubernetes	5	746	July 19, 2022

Workers not polling for tasks

Related topics