Temporal Server Health Check

andrei · January 23, 2021, 2:16pm

Hello,

We have been running Temporal in our staging environment for some time now. The load is typically small and we have experienced a very stable server(very few restarts). We are running it in Kubernetes and we have written the deployment configuration ourselves (taking inspiration from the provided helm charts).

Now area planning to deploy Temporal in our production environments soon and we wanted to validate some operational related details with you guys.

Our main concern at the moment is the server health check. We have noticed that you have set the gRPC health checks for the front-end, matching and history servers, but not the worker server. The questions would be:

Is the gRPC health check enough? Meaning, could it happen that the rpc communication works, but the servers are still in a broken state?
Why is there no health check for the worker? We are currently not using any features of the worker (single cluster setup, no archival), but wouldn’t a health check be necessary if we did use it? There is no livenessProbe set for this server at the moment in our deployment.
We noticed that the CLI tool only checks the health of the temporal front-end server. Is it actually checking that all the components are working properly as well?

On another note, we are also starting an evaluation on the actual workflow execution reliability (activity/workflows retries, handling unexpected/expected failures). Do you have any guidelines on that?

Thanks,
Andrei

alex · January 26, 2021, 6:05am

gRPC health check is a basic health check which shows that server is up and running and can, at least, accepts requests. There is no logic behind it. It doesn’t check database status for instance or any other internal components. worker doesn’t have it because there is no gRPC handler at all. We thought about adding handler just for health check purpose there though but it is still in backlog. CLI only checks frontend health check because this is the only thing (post 7233) supposed to be exposed.

To get comprehensive picture of cluster health you need to use metrics, setup dashboards, and alerts.

andrei · January 27, 2021, 8:54am

Thanks for getting back to me @alex!

We are currently setting up Grafana dashboards and alerts, so that part is covered.

What we we’re worried about was the scenario when the front-end service is up and running (grpc health check ok), but it’s dependencies are not (e.g. the matching service is down, or the DB is temporarily down). In this case, Temporal would still accept requests from its clients, but will not be able to honor them.

Do you have any insights into what would happen in this scenario and how we can handle it? Will the front-end service fail the request and let the workers retry, for example?

alex · January 28, 2021, 1:22am

If key components such as db or history is down, frontend APIs will give you errors and workers will just won’t be able to make any progress. There is no such thing as swallow errors. If something goes wrong errors will be propagated up to the worker and you will see it in Grafana and logs. Workers and frontend will keep retrying and as soon as component is back it will proceed.
We have pipelines which emulates this scenario (killing random nodes) and system survives it. Of course you need to have several nodes of everything (cassandra, frontend, history, matching).

max · April 18, 2022, 1:10am

Apologies for the off-topic reply. I couldn’t find documentation for the grpc health checks. For anyone trying to set them up, this is the service name you need: tctl/clusterCommands.go at 79ca2a900aae7d2b15688f536fa05edef969c2a7 · temporalio/tctl · GitHub

tihomir · April 18, 2022, 3:59pm

Thanks @max

just wanted to add that this is the name for frontend service check only, which you can also do via:

tctl cluster health

Matching Service: temporal.api.workflowservice.v1.MatchingService
History Service: temporal.api.workflowservice.v1.HistoryService

f1bonacc1 · January 13, 2024, 12:40am

In case of a DB password change, is there a way to make the worker and other temporal services restart and load the new value from the DB password secret, using the liveness check?

Topic		Replies	Views
Temporal Server Logs location & Health check context url Community Support go-sdk	6	3982	September 14, 2021
Connection failure Community Support go-sdk	1	1301	October 15, 2021
gRPC health checks for the worker node and custom API creation Community Support grpc	4	1567	March 18, 2022
Temporal Client/Worker health-check Community Support java-sdk	8	4001	January 21, 2021
Temporal history and Matching service health check Community Support tctl , healthcheck	13	1748	March 1, 2022

Temporal Server Health Check

Related topics