I came across several answers relating to temporal servermetrics. I am new to temporal and would like to understand the following:
When is service_latency, service_error_with_type, persistence_errors and persistence_latency metric useful?
what does task_schedule_to_start_latency do?
3.Difference between an sync match and async match. what is an async match latency ?
Difference between workflow lock contention and shard Lock contention. what does cache_latency_bucket do?
Are there any other important metrics that need to be monitored from a cluster level? Please mention if there any document that tells what every server metric does.
service_latency is typically starting point for investigating which operation(s) have latency issues service_error_with_type (since server version 1.17) would give you indication what service errors you might be running into persistence_errors for checking any persistence-related issues, not does not include resource exhausted issues, for that check persistence_errors_resource_exhausted persistence_latency can give you indication of your db has enough capacity to handle requests
task_schedule_to_start_latency gives you the schedule (when workflow/activity task is placed on matching task queue) to start (when task is dispatched to your worker poller) latency. It can be very useful for detecting worker capacity and help optimize worker performance. It’s the service-side equivalent of sdk metrics workflow_task_schedule_to_start_latency and activity_schedule_to_start_latency
workflow lock contention - any updates to a workflow execution are done under a per-workflow lock. if you have a very large number of updates to single execution, for example if you start a very large number of activities in parallel, each of these updates (scheduling of activities) would try to obtain the workflow lock, only one can do it at a time cache_latency measures history cache get operation latency including obtaining a lock, if high means is waiting on per-workflow lock and is good indication to restructure your workflow impl to not start too many activities/child workflows in parallel. same could be case for large number of activities that all heartbeat
synch match means matching service can dispatch a task (workflow/activity task) from task queue to one of your worker pollers without having to persist it (meaning your workers have enough capacity and available executor slots and pollers to handle new tasks very quickly. async match means matching service had to persist tasks to db and then dispatch them to one of your worker pollers when they become available. sync match rate is the ration of sync and async match polls and gives a pretty good indication about your worker capacity as in being able to process all tasks for use case as fast as possible.
Thank you @tihomir for your quick response. That helps.
Can you tell me the unit for these metrics ? Service_requests, service_error_with_type and persistence_requests.
Is there any threshold that can be set for persistence_latency and service_latency?
service_requests, service_error_with_type, persistence_requests are all counters
persistence_latency will depend on your db capacity and workload, typically should be in low hundreds of ms. Note that unprovisioned workers can also add extra pressure on db.
Similarly service_latency might also depend on how optimized your workers are as well as use case (for example burst use cases)