Checking worker is running

rnmulchandani · April 18, 2024, 4:03pm

We have configured the SDK metrics. Could you please tell us which metric and what criteria we can use to check whether the worker is running and ready to serve request?

Chad_Retz · April 18, 2024, 4:07pm

There’s not a single metric because “running and ready to serve request” is not a single state for worker polling. But specific to request failures, you can use the temporal_long_request_failure to see a count of client errors that are happening when the worker is trying to poll for work. We also log these errors (some are after some attempts).

rnmulchandani · April 19, 2024, 7:22am

Please suggest what metrics we should use to declare workers as healthy and ready to serve.

Our main objective is to check whether the worker is ready to serve the request. If it is not, mark the microservice as unhealthy.
In the case of a worker hung, we are unable to determine. It says the service is healthy but the worker is not accepting any requests.

Chad_Retz · April 19, 2024, 11:42am

If the worker.run() is running, it is “ready”, but it may not serve requests because it has hit its max concurrent limit. If temporal_long_request_failure is increasing it may not be able to communicate with server. You can also check logs for errors. We have opened an issue for more detailed programmatic worker status.

rnmulchandani · April 19, 2024, 12:28pm

As we know the default polling interval is 1 min. Can we check if the temporal_long_request_failure is increased by 5 in the last 5 minutes? This means that the worker is not able to serve.

Chad_Retz · April 19, 2024, 12:31pm

That’s the default for successful polls not getting work. Failures can come before then. If the request failure metric increases at all it’s a problem at that time (and if it doesn’t for, say a minute, and/or long request success increases, and worker.run() has not raised an error, then it may have regained connectivity).

rnmulchandani · April 22, 2024, 10:20am

Do you mean even with temporal_long_request_failure we can’t determine Temporal worker can server the request or not?

What is the solution then?

Chad_Retz · April 22, 2024, 12:31pm

You can use that metric determine whether it cannot get work to do because requests are failing. Usually the only other reason a worker may not get work is if it is “full”, i.e. there are no slots available because it has reached its max concurrent limits. You can use temporal_worker_slots_available to see if it reaches 0 which means you have run out of available slots to do work and the worker will stop asking for work. See the worker performance guide.

Those are basically the two ways a worker stops taking work (can’t reach server or reached limits).

rnmulchandani · April 22, 2024, 12:51pm

I am talking about this bug [Bug] Worker.finalize_shutdown seems to hang when poll never succeeded due to server permission failure · Issue #667 · temporalio/sdk-core · GitHub. If this occurs again, how can I say that microservice is down? What parameters should I check?

Chad_Retz · April 22, 2024, 1:45pm

Yes, temporal_long_request_failure will increase when polls fail due to permission failure like in that case in that issue (that bug specifically concerns how shutdown is handled in the face of repeated versions of these failures, unrelated to reporting of the failures which is done via metric and in logs after enough retries).

Topic		Replies	Views
How do I get a list of / number of running workers from the Python SDK? Community Support	1	750	September 19, 2023
Is there a way to check the health of workers? Community Support java-sdk	2	2811	June 10, 2022
Health Check in the Python SDK Community Support python-sdk	1	647	October 12, 2023
Query List of Connected Activity/Workflow Workers in Temporal Community Support python-sdk	3	1451	March 8, 2023
No Workers Running Please make sure you have at least one worker connected to the nextwave-queue Task Queue Community Support python-sdk , activity	2	116	March 18, 2025

Checking worker is running

Related topics