Looking for certain metrics to alarms on

nithin · October 30, 2020, 5:40pm

I am looking for the following metrics to alarm/monitor on for my workflows

Workflow time out
Activity times out
Current Task queue size (to alarm/autoscale when backlog is large)

I was not able to find these as part of the client metrics. I am guessing these are available as part of the server metrics, hopefully broken down by namespace. Could any one point me to the the correct metrics and any documentation available for server emitted metrics

manu · October 30, 2020, 8:23pm

Hey Nithin,

Here are some of our prometheus queries on metrics that cover your scenario:

Workflow Timeout (server-side):
sum(rate(workflow_timeout{cluster="$cluster",temporal_service_type=“history”, operation=“CompletionStats”}[5m]))

Task Queue backlog (client-side):
The ‘temporal_activity_schedule_to_start_latency’ can be used to infer when tasks are piling up. That number increasing means that its taking longer for scheduled tasks to start getting processed, which is indicative of a backlog developing.

Activity Timeout (server-side):
sum(rate(schedule_to_close_timeout{temporal_service_type=“history”,operation=“TimerActiveTaskActivityTimeout”}[5m]))

(Note that there are similar metrics for start_to_close_timeout, heartbeat_timeout and schedule_to_start_timeout)

nithin · October 30, 2020, 9:02pm

Thanks Manu, is there any documentation for all the server side metrics?

temporal_activity_schedule_to_start_latency is useful but it is emitted only when the task is picked up for execution and may not correctly portray the backlog size (only how much the current task was backlogged).

I do agree this temporarily solves the problem but i would love it if temporal service could emit the backlog size metric similar to SQS/ Google pubsub queue size metric

maxim · October 31, 2020, 12:10am

We are looking into emitting such metric. The problem is that it is not as simple as SQS/Google pubsub use case as these systems don’t have a potentially different ScheduleToStart timeout for each message. So while Temporal knows the number of messages that were put in a queue this number is not what we can report as the queue might contain any number of already timed out messages.

Topic		Replies	Views
Exposing activity queue length via Signals and custom metrics Server Deployment	1	133	September 10, 2024
Clarification on metrics (client + server) Community Support java-sdk , metrics	14	2678	April 13, 2022
Temporal metric for task queue size/backlog, or schedule to start latency for task queue Community Support metrics	3	3149	December 7, 2021
Scaling temporal worker Community Support	2	445	March 30, 2025
Strategies for Scaling AWS Services Community Support scaling	9	2233	October 1, 2021

Looking for certain metrics to alarms on

Related topics