Understanding Temporal Grafana Metrics

Hi, many of the metrics are pretty straightforward to understand but have a few questions on the below ones. The restart graph shows >0 number, whereas the k8s doesn’t show the same, it is 0 there. Can you please explain if this is of some concern and what do they represent?

In our test with two task queues each taking 400 new workflows/sec, the errored workflows are pretty much 0.3% of the total over a period of 8 hours, however this graph shows a huge gap of (780-550=230), what is the gap here representing? As failed are 0 and we are not seeing so many errors in our test.

What are these entity-not-found errors on history/frontend server and how to fix them? Again, these errors are somehow not showing up in our test where the error rate is only 0.3% over 8 hours.

Hi, can you one of you please help with this info?

Hi @ramyamagham I am not an expert in server metrics but will try to help.

  1. restarts - counter (for each of the temporal server services) indicates how many times the service / pod has restarted so far. There is no time associated with the metric so these restarts can show counts over a long period of time / multiple restarts/redeployments.

  2. workflow tasks - a) scheduled recorded in the matching service b) started/completed/failed recorded in the history service. A workflow can have multiple workflow tasks depending on things like how many activities/child workflows it executes etc. I think of it as a “unit of progress” and it generally corresponds to a db write (on the temporal server) during workflow execution. Your graph shows 0 failed workflow tasks which goes in line I think with what you said about your tests.

  3. entity not found - this is a grouping (counter) that can include a number of different operations, such as GetWorkflowExecutionHistory, PollMutableState, PollActivityTaskQueue etc. You can create a dashboard that shows these specific to its associated operation using for example this query:

sum by (operation) (rate(service_errors{cluster="$cluster",temporal_service_type=~"$Service"}[5m]))

this should give you more details and probably make more sense when looking at it.

Hope this helps some.

1 Like