We have configured the SDK metrics. Could you please tell us which metric and what criteria we can use to check whether the worker is running and ready to serve request?
Thereâs not a single metric because ârunning and ready to serve requestâ is not a single state for worker polling. But specific to request failures, you can use the temporal_long_request_failure
to see a count of client errors that are happening when the worker is trying to poll for work. We also log these errors (some are after some attempts).
Please suggest what metrics we should use to declare workers as healthy and ready to serve.
Our main objective is to check whether the worker is ready to serve the request. If it is not, mark the microservice as unhealthy.
In the case of a worker hung, we are unable to determine. It says the service is healthy but the worker is not accepting any requests.
If the worker.run()
is running, it is âreadyâ, but it may not serve requests because it has hit its max concurrent limit. If temporal_long_request_failure
is increasing it may not be able to communicate with server. You can also check logs for errors. We have opened an issue for more detailed programmatic worker status.
As we know the default polling interval is 1 min. Can we check if the temporal_long_request_failure is increased by 5 in the last 5 minutes? This means that the worker is not able to serve.
Thatâs the default for successful polls not getting work. Failures can come before then. If the request failure metric increases at all itâs a problem at that time (and if it doesnât for, say a minute, and/or long request success increases, and worker.run()
has not raised an error, then it may have regained connectivity).
Do you mean even with temporal_long_request_failure we canât determine Temporal worker can server the request or not?
What is the solution then?
You can use that metric determine whether it cannot get work to do because requests are failing. Usually the only other reason a worker may not get work is if it is âfullâ, i.e. there are no slots available because it has reached its max concurrent limits. You can use temporal_worker_slots_available
to see if it reaches 0 which means you have run out of available slots to do work and the worker will stop asking for work. See the worker performance guide.
Those are basically the two ways a worker stops taking work (canât reach server or reached limits).
I am talking about this bug [Bug] Worker.finalize_shutdown seems to hang when poll never succeeded due to server permission failure ¡ Issue #667 ¡ temporalio/sdk-core ¡ GitHub. If this occurs again, how can I say that microservice is down? What parameters should I check?
Yes, temporal_long_request_failure
will increase when polls fail due to permission failure like in that case in that issue (that bug specifically concerns how shutdown is handled in the face of repeated versions of these failures, unrelated to reporting of the failures which is done via metric and in logs after enough retries).