We have been running Temporal in our staging environment for some time now. The load is typically small and we have experienced a very stable server(very few restarts). We are running it in Kubernetes and we have written the deployment configuration ourselves (taking inspiration from the provided helm charts).
Now area planning to deploy Temporal in our production environments soon and we wanted to validate some operational related details with you guys.
Our main concern at the moment is the server health check. We have noticed that you have set the gRPC health checks for the front-end, matching and history servers, but not the worker server. The questions would be:
Is the gRPC health check enough? Meaning, could it happen that the rpc communication works, but the servers are still in a broken state?
Why is there no health check for the worker? We are currently not using any features of the worker (single cluster setup, no archival), but wouldn’t a health check be necessary if we did use it? There is no livenessProbe set for this server at the moment in our deployment.
We noticed that the CLI tool only checks the health of the temporal front-end server. Is it actually checking that all the components are working properly as well?
On another note, we are also starting an evaluation on the actual workflow execution reliability (activity/workflows retries, handling unexpected/expected failures). Do you have any guidelines on that?
gRPC health check is a basic health check which shows that server is up and running and can, at least, accepts requests. There is no logic behind it. It doesn’t check database status for instance or any other internal components. worker doesn’t have it because there is no gRPC handler at all. We thought about adding handler just for health check purpose there though but it is still in backlog. CLI only checks frontend health check because this is the only thing (post 7233) supposed to be exposed.
To get comprehensive picture of cluster health you need to use metrics, setup dashboards, and alerts.
We are currently setting up Grafana dashboards and alerts, so that part is covered.
What we we’re worried about was the scenario when the front-end service is up and running (grpc health check ok), but it’s dependencies are not (e.g. the matching service is down, or the DB is temporarily down). In this case, Temporal would still accept requests from its clients, but will not be able to honor them.
Do you have any insights into what would happen in this scenario and how we can handle it? Will the front-end service fail the request and let the workers retry, for example?
If key components such as db or history is down, frontend APIs will give you errors and workers will just won’t be able to make any progress. There is no such thing as swallow errors. If something goes wrong errors will be propagated up to the worker and you will see it in Grafana and logs. Workers and frontend will keep retrying and as soon as component is back it will proceed.
We have pipelines which emulates this scenario (killing random nodes) and system survives it. Of course you need to have several nodes of everything (cassandra, frontend, history, matching).
In case of a DB password change, is there a way to make the worker and other temporal services restart and load the new value from the DB password secret, using the liveness check?