I am looking for the following metrics to alarm/monitor on for my workflows
Workflow time out
Activity times out
Current Task queue size (to alarm/autoscale when backlog is large)
I was not able to find these as part of the client metrics. I am guessing these are available as part of the server metrics, hopefully broken down by namespace. Could any one point me to the the correct metrics and any documentation available for server emitted metrics
Task Queue backlog (client-side):
The ‘temporal_activity_schedule_to_start_latency’ can be used to infer when tasks are piling up. That number increasing means that its taking longer for scheduled tasks to start getting processed, which is indicative of a backlog developing.
Thanks Manu, is there any documentation for all the server side metrics?
temporal_activity_schedule_to_start_latency is useful but it is emitted only when the task is picked up for execution and may not correctly portray the backlog size (only how much the current task was backlogged).
I do agree this temporarily solves the problem but i would love it if temporal service could emit the backlog size metric similar to SQS/ Google pubsub queue size metric
We are looking into emitting such metric. The problem is that it is not as simple as SQS/Google pubsub use case as these systems don’t have a potentially different ScheduleToStart timeout for each message. So while Temporal knows the number of messages that were put in a queue this number is not what we can report as the queue might contain any number of already timed out messages.