We have been running Temporal in our staging environment for some time now. The load is typically small and we have experienced a very stable server(very few restarts). We are running it in Kubernetes and we have written the deployment configuration ourselves (taking inspiration from the provided helm charts).
Now area planning to deploy Temporal in our production environments soon and we wanted to validate some operational related details with you guys.
Our main concern at the moment is the server health check. We have noticed that you have set the gRPC health checks for the
history servers, but not the
worker server. The questions would be:
- Is the gRPC health check enough? Meaning, could it happen that the rpc communication works, but the servers are still in a broken state?
- Why is there no health check for the
worker? We are currently not using any features of the
worker(single cluster setup, no archival), but wouldn’t a health check be necessary if we did use it? There is no
livenessProbeset for this server at the moment in our deployment.
- We noticed that the CLI tool only checks the health of the temporal front-end server. Is it actually checking that all the components are working properly as well?
On another note, we are also starting an evaluation on the actual workflow execution reliability (activity/workflows retries, handling unexpected/expected failures). Do you have any guidelines on that?