This is a lengthy topic and we plan to provide more information on operationalizing Temporal clusters in the future. We recently started dashboard repo which has some basic dashboards which gives visibility into Temporal service. This is work in progress so please use this just as a reference at this point. In the future we plan to have fully supported version of grafana dashboards using PromQL.
All the metric emitted by server are listed in defs.go . So if you see somethings are missing in the dashboards then you can use the defs.go as a reference.
Some of the most useful metrics are the following:
For each request by service handler we emit service_requests, service_errors, and service_latency metric with type, operation, and namespace tags. This gives you basic visibility into service usage and allows you to look request rates across services, namespaces or even operations.
Temporal emits persistence_requests, persistence_errors and persistence_latency metric for each persistence operation. These metrics are tagged with operation tag to allow getting request rates, error rates or latencies per operation. These are super useful to identify issues caused by database problems.
Temporal also emits counters on completion of workflows. These are super useful in getting overall stats about workflow completion. Use workflow_success, workflow_failed, workflow_timeout, workflow_terminate and workflow_cancel counters for each type of workflow completion. They are also tagged with namespace tag.
I would start with these 3 basic category of metrics before digging deeper into metrics which gives insights into other internals of the system. Apart from these metric emitted by the server you also want to monitor infrastructure specific metric like cpu, memory, network for all hosts running Temporal roles.
Hope this gives you some starting point. Feel free to ask if you have questions about a specific metric or some specific use case in mind.