I am trying to solve the observability of our temporal cluster, and I am not sure if there is a best practice to measure and monitor if the server is running well, and if the applications using it are running well.
I tried to define a couple of metrics to describe the circumstances, and am not sure if you guys can kindly offer me some documents for the following questions:
how to observe the following metrics, how to build them on one board, and set alerts on them
do these metrics make sense or if I need other metrics for observation
If Tasks are Piling up in Queues
How many queues, and related workers/signals/queries to them
Piling up unhandled tasks in queues
The time of the activity event processing time
Server-Side GRPC methods metrics
Total QPS and Successful QPS
Latency
Failures and Risks in Workflow Executions and Activities
Activity Failure: If one activity fails (probably all activities of the same type) with a tolerant retry policy, how to quickly find it
Activity Latency: If one activity’s process time is increasing to an abnormal amount, how to observe it
have the same failure and latency problems for executions
If you want to get started pretty quick take a look at this docker compose repo, it has built in prometheus with scrape points as well as SDK and Server dashboards that you can use out of box to get you started in the right direction. Hope this helps.