Understanding Temporal Grafana Metrics

ramyamagham · August 20, 2021, 7:06pm

Hi, many of the metrics are pretty straightforward to understand but have a few questions on the below ones. The restart graph shows >0 number, whereas the k8s doesn’t show the same, it is 0 there. Can you please explain if this is of some concern and what do they represent?

In our test with two task queues each taking 400 new workflows/sec, the errored workflows are pretty much 0.3% of the total over a period of 8 hours, however this graph shows a huge gap of (780-550=230), what is the gap here representing? As failed are 0 and we are not seeing so many errors in our test.

What are these entity-not-found errors on history/frontend server and how to fix them? Again, these errors are somehow not showing up in our test where the error rate is only 0.3% over 8 hours.

ramyamagham · August 23, 2021, 6:41pm

Hi, can you one of you please help with this info?

tihomir · August 23, 2021, 9:42pm

Hi @ramyamagham I am not an expert in server metrics but will try to help.

restarts - counter (for each of the temporal server services) indicates how many times the service / pod has restarted so far. There is no time associated with the metric so these restarts can show counts over a long period of time / multiple restarts/redeployments.
workflow tasks - a) scheduled recorded in the matching service b) started/completed/failed recorded in the history service. A workflow can have multiple workflow tasks depending on things like how many activities/child workflows it executes etc. I think of it as a “unit of progress” and it generally corresponds to a db write (on the temporal server) during workflow execution. Your graph shows 0 failed workflow tasks which goes in line I think with what you said about your tests.
entity not found - this is a grouping (counter) that can include a number of different operations, such as GetWorkflowExecutionHistory, PollMutableState, PollActivityTaskQueue etc. You can create a dashboard that shows these specific to its associated operation using for example this query:

sum by (operation) (rate(service_errors{cluster="$cluster",temporal_service_type=~"$Service"}[5m]))

this should give you more details and probably make more sense when looking at it.

Hope this helps some.

Topic		Replies	Views
Guidance on creating and interpreting Grafana dashboards Community Support prometheus , metrics	3	5630	June 28, 2024
Workflow Performance with Java SDK Community Support java-sdk	1	722	February 20, 2023
Clarification on metrics (client + server) Community Support java-sdk , metrics	14	2588	April 13, 2022
Individual workflow metric Community Support metrics	4	1400	March 30, 2022
Properly plot counters with Prometheus in Grafana Community Support java-sdk , metrics	4	2575	March 2, 2023

Understanding Temporal Grafana Metrics

Related topics