Guidance on creating and interpreting Grafana dashboards

rfwagner · August 19, 2020, 7:59pm

Can you offer some guidance on using prometheus+grafana to monitor Temporal. Some topics that would be useful:

what are the top tier metrics that must be monitored
a brief high level description of what each top-tier metric is measuring
what are the units being displayed
real-world experience on correlating user-visible issues to metrics that can help diagnose those issues

Any help on this topic would be much appreciated!

samar · August 20, 2020, 6:34am

Hey @rfwagner,
This is a lengthy topic and we plan to provide more information on operationalizing Temporal clusters in the future. We recently started dashboard repo which has some basic dashboards which gives visibility into Temporal service. This is work in progress so please use this just as a reference at this point. In the future we plan to have fully supported version of grafana dashboards using PromQL.

All the metric emitted by server are listed in defs.go . So if you see somethings are missing in the dashboards then you can use the defs.go as a reference.

Some of the most useful metrics are the following:

Service Metric:
For each request by service handler we emit service_requests, service_errors, and service_latency metric with type, operation, and namespace tags. This gives you basic visibility into service usage and allows you to look request rates across services, namespaces or even operations.
Persistence Metric:
Temporal emits persistence_requests, persistence_errors and persistence_latency metric for each persistence operation. These metrics are tagged with operation tag to allow getting request rates, error rates or latencies per operation. These are super useful to identify issues caused by database problems.
Workflow Stats:
Temporal also emits counters on completion of workflows. These are super useful in getting overall stats about workflow completion. Use workflow_success, workflow_failed, workflow_timeout, workflow_terminate and workflow_cancel counters for each type of workflow completion. They are also tagged with namespace tag.

I would start with these 3 basic category of metrics before digging deeper into metrics which gives insights into other internals of the system. Apart from these metric emitted by the server you also want to monitor infrastructure specific metric like cpu, memory, network for all hosts running Temporal roles.

Hope this gives you some starting point. Feel free to ask if you have questions about a specific metric or some specific use case in mind.

Steven_Sun · August 27, 2020, 6:38pm

@samar
I asked this question in here, would you mind answer it here. since i always see poll request very high and Maxim mentioned it should not be the case: Worker Setup Recommendations - #3 by maxim

Nicolas_Gaviria · June 28, 2024, 3:47pm

thanks @samar for the clarifications and the suggestions,

Just one question:
would it be possible to filter all this metrics by a workflow_id? it would be great to see the impact of some improvements over time on a single workflow_id

thanks in advance.

Topic		Replies	Views
How to observe the temporal cluster? Server Deployment helm , cassandra , metrics	4	3703	February 1, 2023
Metrics For Monitoring Server Performance Community Support performance , metrics	2	4120	August 27, 2020
What metrics does temporal expose out of box and how to consume this in prometheus? Community Support prometheus , metrics	10	9034	August 5, 2022
Metrics Monitoring Changes from Cadence -> Temporal Community Support	1	1112	April 8, 2021
Mismatch between Temporal UI workflow counts and official Grafana dashboard metrics Community Support	2	38	November 25, 2025

Guidance on creating and interpreting Grafana dashboards

Related topics