How to observe the temporal cluster?

I am trying to solve the observability of our temporal cluster, and I am not sure if there is a best practice to measure and monitor if the server is running well, and if the applications using it are running well.
I tried to define a couple of metrics to describe the circumstances, and am not sure if you guys can kindly offer me some documents for the following questions:

  1. how to observe the following metrics, how to build them on one board, and set alerts on them

  2. do these metrics make sense or if I need other metrics for observation

  • If Tasks are Piling up in Queues
    • How many queues, and related workers/signals/queries to them
    • Piling up unhandled tasks in queues
    • The time of the activity event processing time
  • Server-Side GRPC methods metrics
    • Total QPS and Successful QPS
    • Latency
  • Failures and Risks in Workflow Executions and Activities
    • Activity Failure: If one activity fails (probably all activities of the same type) with a tolerant retry policy, how to quickly find it
    • Activity Latency: If one activity’s process time is increasing to an abnormal amount, how to observe it
    • have the same failure and latency problems for executions

Couple related forum posts that might be helpful to you:

For SDK Metrics:
Worker tuning guide: How to tune Workers | Temporal Documentation
list of metrics produced:
Go SDK
Java SDK
and SDK metrics docs: SDK metrics | Temporal Documentation

If you want to get started pretty quick take a look at this docker compose repo, it has built in prometheus with scrape points as well as SDK and Server dashboards that you can use out of box to get you started in the right direction. Hope this helps.

1 Like

To add a couple of server metrics grafana queries that might be useful to you:

  1. Service Errors by type (change service type to the service you want):

sum(rate(service_error_with_type{service_type="frontend"}[5m])) by (error_type)

  1. Service latencies by operation (again you can change service_type if needed):

histogram_quantile(0.95, sum(rate(service_latency_bucket{service_type="frontend"}[5m])) by (operation, le))

  1. Persistence latencies by operation

histogram_quantile(0.95, sum(rate(persistence_latency_bucket{}[1m])) by (operation, le))

  1. Sync match rate:

sum(rate(poll_success_sync{}[1m])) / sum(rate(poll_success{}[1m]))

  1. Workflow lock contention:

histogram_quantile(0.99, sum(rate(cache_latency_bucket{operation="HistoryCacheGetOrCreate"}[1m])) by (le))

  1. Shard Lock contention

histogram_quantile(0.99, sum(rate(lock_latency_bucket{operation="ShardInfo"}[1m])) by (le))

  1. Shard movement:

sum(rate(sharditem_created_count{}[1m]))

  1. Cluster restarts

sum(restarts)

  1. Visibility latencies:

histogram_quantile(0.95, sum(rate(task_latency_bucket{operation=~"VisibilityTask.*", service_name="history"}[1m])) by (operation, le))

  1. Resources exhausted due to limits:

sum(rate(service_errors_resource_exhausted{}[1m])) by (resource_exhausted_cause)

2 Likes