How to observe the temporal cluster?

To add a couple of server metrics grafana queries that might be useful to you:

  1. Service Errors by type (change service type to the service you want):

sum(rate(service_error_with_type{service_name="frontend"}[5m])) by (error_type)

  1. Service latencies by operation (again you can change service_name if needed):

histogram_quantile(0.95, sum(rate(service_latency_bucket{service_name="frontend"}[5m])) by (operation, le))

  1. Persistence latencies by operation

histogram_quantile(0.95, sum(rate(persistence_latency_bucket{}[1m])) by (operation, le))

  1. Sync match rate:

sum(rate(poll_success_sync{}[1m])) / sum(rate(poll_success{}[1m]))

  1. Workflow lock contention:

histogram_quantile(0.99, sum(rate(cache_latency_bucket{operation="HistoryCacheGetOrCreate"}[1m])) by (le))

  1. Shard Lock contention

histogram_quantile(0.99, sum(rate(lock_latency_bucket{operation="ShardInfo"}[1m])) by (le))

  1. Shard movement:

sum(rate(sharditem_created_count{}[1m]))

  1. Cluster restarts

sum(restarts)

  1. Visibility latencies:

histogram_quantile(0.95, sum(rate(task_latency_bucket{operation=~"VisibilityTask.*", service_name="history"}[1m])) by (operation, le))

  1. Resources exhausted due to limits:

sum(rate(service_errors_resource_exhausted{}[1m])) by (resource_exhausted_cause)

5 Likes