How to observe the temporal cluster?

maki_XIE · June 28, 2022, 10:12am

I am trying to solve the observability of our temporal cluster, and I am not sure if there is a best practice to measure and monitor if the server is running well, and if the applications using it are running well.
I tried to define a couple of metrics to describe the circumstances, and am not sure if you guys can kindly offer me some documents for the following questions:

how to observe the following metrics, how to build them on one board, and set alerts on them
do these metrics make sense or if I need other metrics for observation

If Tasks are Piling up in Queues
- How many queues, and related workers/signals/queries to them
- Piling up unhandled tasks in queues
- The time of the activity event processing time
Server-Side GRPC methods metrics
- Total QPS and Successful QPS
- Latency
Failures and Risks in Workflow Executions and Activities
- Activity Failure: If one activity fails (probably all activities of the same type) with a tolerant retry policy, how to quickly find it
- Activity Latency: If one activity’s process time is increasing to an abnormal amount, how to observe it
- have the same failure and latency problems for executions

tihomir · June 28, 2022, 1:16pm

Couple related forum posts that might be helpful to you:

For SDK Metrics:
Worker tuning guide: Developer's guide - Worker performance | Temporal Documentation
list of metrics produced:
Go SDK
Java SDK
and SDK metrics docs: Temporal SDK metrics reference | Temporal Documentation

If you want to get started pretty quick take a look at this docker compose repo, it has built in prometheus with scrape points as well as SDK and Server dashboards that you can use out of box to get you started in the right direction. Hope this helps.

tihomir · June 28, 2022, 1:29pm

To add a couple of server metrics grafana queries that might be useful to you:

Service Errors by type (change service type to the service you want):

sum(rate(service_error_with_type{service_name="frontend"}[5m])) by (error_type)

Service latencies by operation (again you can change service_name if needed):

histogram_quantile(0.95, sum(rate(service_latency_bucket{service_name="frontend"}[5m])) by (operation, le))

Persistence latencies by operation

histogram_quantile(0.95, sum(rate(persistence_latency_bucket{}[1m])) by (operation, le))

Sync match rate:

sum(rate(poll_success_sync{}[1m])) / sum(rate(poll_success{}[1m]))

Workflow lock contention:

histogram_quantile(0.99, sum(rate(cache_latency_bucket{operation="HistoryCacheGetOrCreate"}[1m])) by (le))

Shard Lock contention

histogram_quantile(0.99, sum(rate(lock_latency_bucket{operation="ShardInfo"}[1m])) by (le))

Shard movement:

sum(rate(sharditem_created_count{}[1m]))

Cluster restarts

sum(restarts)

Visibility latencies:

histogram_quantile(0.95, sum(rate(task_latency_bucket{operation=~"VisibilityTask.*", service_name="history"}[1m])) by (operation, le))

Resources exhausted due to limits:

sum(rate(service_errors_resource_exhausted{}[1m])) by (resource_exhausted_cause)

Hilman_Adam · February 1, 2023, 6:15am

Hi Tihomir,

Persistence latencies by operation
histogram_quantile(0.95, sum(rate(persistence_latency_bucket{}[1m])) by (operation, le))

What’s the unit for this metrics? is it seconds or milliseconds?

tihomir · February 1, 2023, 6:32am

For Prometheus the bucket sizes are in seconds / fractions of seconds.

Topic		Replies	Views
Metrics For Monitoring Server Performance Community Support performance , metrics	2	4054	August 27, 2020
Guidance on creating and interpreting Grafana dashboards Community Support prometheus , metrics	3	5654	June 28, 2024
What metrics does temporal expose out of box and how to consume this in prometheus? Community Support prometheus , metrics	10	8733	August 5, 2022
How to monitor ScheduleToStart latency Community Support metrics	10	1961	February 28, 2022
Metrics Monitoring Changes from Cadence -> Temporal Community Support	1	1097	April 8, 2021

How to observe the temporal cluster?

Related topics