While comparing the Grafana dashboard metrics with the values shown in the Temporal UI, we are noticing significant discrepancies.
For example, for a given time range (24 Nov, 11:00 AM – 12:30 PM):
Temporal UI shows:
X Completed
Y Running
Z Continued As New
etc.
But the official Grafana dashboard panels (which use metrics like temporal_cloud_v0_workflow_success_count, temporal_cloud_v0_workflow_failed_count, temporal_cloud_v0_workflow_terminate_count, temporal_cloud_v0_workflow_continued_as_new_count, etc.) show very different numbers.
Even after aligning the time range, the counts from Prometheus do not match what the UI reports.
Based on our investigation, it appears that:
The Temporal UI is showing the number of workflows in each final state.
The Grafana dashboard is showing the number of state transition events, where counters can increment multiple times for a single workflow (e.g., retries, continues-as-new, failures before eventual success).
Because of this, a single workflow may generate multiple increments in Prometheus metrics, so the values can’t match 1:1 with the UI.
My questions:
Is this interpretation correct?
Is the mismatch expected for the official Grafana dashboard?
Is there any recommended way to get UI-equivalent workflow counts (based on Visibility state) exposed as Prometheus metrics?
We want to verify if our dashboards are behaving correctly, or if something needs to be configured differently.
Data you look at in the UI is visibility data. This data is bound by your namespace retention period, so for example number of completed executions would show executions that have completed within time X where X is your namespace retention period, for example set to 10 days (iirs max is 90 days on cloud)
The Grafana dashboard is showing the number of state transition events, where counters can increment multiple times for a single workflow
Not sure this is correct as these are server metrics. Your SDK (worker) metrics could be larger as sdk worker requests for example completion of a workflow execution which server can under some conditions not accept, but what you are showing are server metrics, and server would fail for example an execution just once (it has final say).
My best guess is that the time range for metrics might be different than that of namespace retention, and maybe you are looking at accumulation? whats the metrics queries you are using for these counters?