Temporal Server Metrics

Prita_Roy · September 12, 2023, 10:49pm

I came across several answers relating to temporal servermetrics. I am new to temporal and would like to understand the following:

When is service_latency, service_error_with_type, persistence_errors and persistence_latency metric useful?
what does task_schedule_to_start_latency do?
3.Difference between an sync match and async match. what is an async match latency ?
Difference between workflow lock contention and shard Lock contention. what does cache_latency_bucket do?

Are there any other important metrics that need to be monitored from a cluster level? Please mention if there any document that tells what every server metric does.

tihomir · September 19, 2023, 5:05am

service_latency is typically starting point for investigating which operation(s) have latency issues
service_error_with_type (since server version 1.17) would give you indication what service errors you might be running into
persistence_errors for checking any persistence-related issues, not does not include resource exhausted issues, for that check persistence_errors_resource_exhausted
persistence_latency can give you indication of your db has enough capacity to handle requests
task_schedule_to_start_latency gives you the schedule (when workflow/activity task is placed on matching task queue) to start (when task is dispatched to your worker poller) latency. It can be very useful for detecting worker capacity and help optimize worker performance. It’s the service-side equivalent of sdk metrics workflow_task_schedule_to_start_latency and activity_schedule_to_start_latency
workflow lock contention - any updates to a workflow execution are done under a per-workflow lock. if you have a very large number of updates to single execution, for example if you start a very large number of activities in parallel, each of these updates (scheduling of activities) would try to obtain the workflow lock, only one can do it at a time
cache_latency measures history cache get operation latency including obtaining a lock, if high means is waiting on per-workflow lock and is good indication to restructure your workflow impl to not start too many activities/child workflows in parallel. same could be case for large number of activities that all heartbeat
synch match means matching service can dispatch a task (workflow/activity task) from task queue to one of your worker pollers without having to persist it (meaning your workers have enough capacity and available executor slots and pollers to handle new tasks very quickly. async match means matching service had to persist tasks to db and then dispatch them to one of your worker pollers when they become available.
sync match rate is the ration of sync and async match polls and gives a pretty good indication about your worker capacity as in being able to process all tasks for use case as fast as possible.

Prita_Roy · September 19, 2023, 12:47pm

Thank you @tihomir for your quick response. That helps.
Can you tell me the unit for these metrics ? Service_requests, service_error_with_type and persistence_requests.
Is there any threshold that can be set for persistence_latency and service_latency?

tihomir · September 19, 2023, 1:26pm

service_requests, service_error_with_type, persistence_requests are all counters

persistence_latency will depend on your db capacity and workload, typically should be in low hundreds of ms. Note that unprovisioned workers can also add extra pressure on db.

Similarly service_latency might also depend on how optimized your workers are as well as use case (for example burst use cases)

Prita_Roy · September 19, 2023, 1:39pm

Thanks for clearing up again @tihomir.

Topic		Replies	Views
Latency Metric that excludes sleep or idle time cases Community Support metrics	3	966	February 27, 2023
Clarification on metrics (client + server) Community Support java-sdk , metrics	14	2637	April 13, 2022
Temporal server metric 'task_schedule_to_start_latency' is being emitted for task_type 'Workflow' only and not 'Activity'' Server Deployment metrics	0	19	May 5, 2025
Matching service GetTaskQueue latency metrics is very large Community Support metrics	1	694	December 11, 2023
Temporal metric for task queue size/backlog, or schedule to start latency for task queue Community Support metrics	3	3084	December 7, 2021

Temporal Server Metrics

Related topics