Hey @rfwagner,
This is a lengthy topic and we plan to provide more information on operationalizing Temporal clusters in the future. We recently started dashboard repo which has some basic dashboards which gives visibility into Temporal service. This is work in progress so please use this just as a reference at this point. In the future we plan to have fully supported version of grafana dashboards using PromQL.
All the metric emitted by server are listed in defs.go . So if you see somethings are missing in the dashboards then you can use the defs.go as a reference.
Some of the most useful metrics are the following:
-
Service Metric:
For each request by service handler we emitservice_requests
,service_errors
, andservice_latency
metric withtype
,operation
, andnamespace
tags. This gives you basic visibility into service usage and allows you to look request rates across services, namespaces or even operations. -
Persistence Metric:
Temporal emitspersistence_requests
,persistence_errors
andpersistence_latency
metric for each persistence operation. These metrics are tagged withoperation
tag to allow getting request rates, error rates or latencies per operation. These are super useful to identify issues caused by database problems. -
Workflow Stats:
Temporal also emits counters on completion of workflows. These are super useful in getting overall stats about workflow completion. Useworkflow_success
,workflow_failed
,workflow_timeout
,workflow_terminate
andworkflow_cancel
counters for each type of workflow completion. They are also tagged with namespace tag.
I would start with these 3 basic category of metrics before digging deeper into metrics which gives insights into other internals of the system. Apart from these metric emitted by the server you also want to monitor infrastructure specific metric like cpu, memory, network for all hosts running Temporal roles.
Hope this gives you some starting point. Feel free to ask if you have questions about a specific metric or some specific use case in mind.