Looking for certain metrics to alarms on

I am looking for the following metrics to alarm/monitor on for my workflows

  1. Workflow time out
  2. Activity times out
  3. Current Task queue size (to alarm/autoscale when backlog is large)

I was not able to find these as part of the client metrics. I am guessing these are available as part of the server metrics, hopefully broken down by namespace. Could any one point me to the the correct metrics and any documentation available for server emitted metrics

1 Like

Hey Nithin,

Here are some of our prometheus queries on metrics that cover your scenario:

Workflow Timeout (server-side):
sum(rate(workflow_timeout{cluster="$cluster",temporal_service_type=“history”, operation=“CompletionStats”}[5m]))

Task Queue backlog (client-side):
The ‘temporal_activity_schedule_to_start_latency’ can be used to infer when tasks are piling up. That number increasing means that its taking longer for scheduled tasks to start getting processed, which is indicative of a backlog developing.

Activity Timeout (server-side):
sum(rate(schedule_to_close_timeout{temporal_service_type=“history”,operation=“TimerActiveTaskActivityTimeout”}[5m]))

(Note that there are similar metrics for start_to_close_timeout, heartbeat_timeout and schedule_to_start_timeout)

1 Like

Thanks Manu, is there any documentation for all the server side metrics?

temporal_activity_schedule_to_start_latency is useful but it is emitted only when the task is picked up for execution and may not correctly portray the backlog size (only how much the current task was backlogged).

I do agree this temporarily solves the problem but i would love it if temporal service could emit the backlog size metric similar to SQS/ Google pubsub queue size metric

We are looking into emitting such metric. The problem is that it is not as simple as SQS/Google pubsub use case as these systems don’t have a potentially different ScheduleToStart timeout for each message. So while Temporal knows the number of messages that were put in a queue this number is not what we can report as the queue might contain any number of already timed out messages.