Guidance on creating and interpreting Grafana dashboards

Can you offer some guidance on using prometheus+grafana to monitor Temporal. Some topics that would be useful:

  1. what are the top tier metrics that must be monitored
  2. a brief high level description of what each top-tier metric is measuring
  3. what are the units being displayed
  4. real-world experience on correlating user-visible issues to metrics that can help diagnose those issues

Any help on this topic would be much appreciated!

Hey @rfwagner,
This is a lengthy topic and we plan to provide more information on operationalizing Temporal clusters in the future. We recently started dashboard repo which has some basic dashboards which gives visibility into Temporal service. This is work in progress so please use this just as a reference at this point. In the future we plan to have fully supported version of grafana dashboards using PromQL.

All the metric emitted by server are listed in defs.go . So if you see somethings are missing in the dashboards then you can use the defs.go as a reference.

Some of the most useful metrics are the following:

  1. Service Metric:
    For each request by service handler we emit service_requests, service_errors, and service_latency metric with type, operation, and namespace tags. This gives you basic visibility into service usage and allows you to look request rates across services, namespaces or even operations.

  2. Persistence Metric:
    Temporal emits persistence_requests, persistence_errors and persistence_latency metric for each persistence operation. These metrics are tagged with operation tag to allow getting request rates, error rates or latencies per operation. These are super useful to identify issues caused by database problems.

  3. Workflow Stats:
    Temporal also emits counters on completion of workflows. These are super useful in getting overall stats about workflow completion. Use workflow_success, workflow_failed, workflow_timeout, workflow_terminate and workflow_cancel counters for each type of workflow completion. They are also tagged with namespace tag.

I would start with these 3 basic category of metrics before digging deeper into metrics which gives insights into other internals of the system. Apart from these metric emitted by the server you also want to monitor infrastructure specific metric like cpu, memory, network for all hosts running Temporal roles.

Hope this gives you some starting point. Feel free to ask if you have questions about a specific metric or some specific use case in mind.

1 Like

I asked this question in here, would you mind answer it here. since i always see poll request very high and Maxim mentioned it should not be the case: Worker Setup Recommendations