Checking worker is running

We have configured the SDK metrics. Could you please tell us which metric and what criteria we can use to check whether the worker is running and ready to serve request?

There’s not a single metric because “running and ready to serve request” is not a single state for worker polling. But specific to request failures, you can use the temporal_long_request_failure to see a count of client errors that are happening when the worker is trying to poll for work. We also log these errors (some are after some attempts).

Please suggest what metrics we should use to declare workers as healthy and ready to serve.

Our main objective is to check whether the worker is ready to serve the request. If it is not, mark the microservice as unhealthy.
In the case of a worker hung, we are unable to determine. It says the service is healthy but the worker is not accepting any requests.

If the worker.run() is running, it is “ready”, but it may not serve requests because it has hit its max concurrent limit. If temporal_long_request_failure is increasing it may not be able to communicate with server. You can also check logs for errors. We have opened an issue for more detailed programmatic worker status.

As we know the default polling interval is 1 min. Can we check if the temporal_long_request_failure is increased by 5 in the last 5 minutes? This means that the worker is not able to serve.

That’s the default for successful polls not getting work. Failures can come before then. If the request failure metric increases at all it’s a problem at that time (and if it doesn’t for, say a minute, and/or long request success increases, and worker.run() has not raised an error, then it may have regained connectivity).

Do you mean even with temporal_long_request_failure we can’t determine Temporal worker can server the request or not?

What is the solution then?

You can use that metric determine whether it cannot get work to do because requests are failing. Usually the only other reason a worker may not get work is if it is “full”, i.e. there are no slots available because it has reached its max concurrent limits. You can use temporal_worker_slots_available to see if it reaches 0 which means you have run out of available slots to do work and the worker will stop asking for work. See the worker performance guide.

Those are basically the two ways a worker stops taking work (can’t reach server or reached limits).

I am talking about this bug [Bug] Worker.finalize_shutdown seems to hang when poll never succeeded due to server permission failure ¡ Issue #667 ¡ temporalio/sdk-core ¡ GitHub. If this occurs again, how can I say that microservice is down? What parameters should I check?

Yes, temporal_long_request_failure will increase when polls fail due to permission failure like in that case in that issue (that bug specifically concerns how shutdown is handled in the face of repeated versions of these failures, unrelated to reporting of the failures which is done via metric and in logs after enough retries).