To add a couple of server metrics grafana queries that might be useful to you:
- Service Errors by type (change service type to the service you want):
sum(rate(service_error_with_type{service_name="frontend"}[5m])) by (error_type)
- Service latencies by operation (again you can change service_name if needed):
histogram_quantile(0.95, sum(rate(service_latency_bucket{service_name="frontend"}[5m])) by (operation, le))
- Persistence latencies by operation
histogram_quantile(0.95, sum(rate(persistence_latency_bucket{}[1m])) by (operation, le))
- Sync match rate:
sum(rate(poll_success_sync{}[1m])) / sum(rate(poll_success{}[1m]))
- Workflow lock contention:
histogram_quantile(0.99, sum(rate(cache_latency_bucket{operation="HistoryCacheGetOrCreate"}[1m])) by (le))
- Shard Lock contention
histogram_quantile(0.99, sum(rate(lock_latency_bucket{operation="ShardInfo"}[1m])) by (le))
- Shard movement:
sum(rate(sharditem_created_count{}[1m]))
- Cluster restarts
sum(restarts)
- Visibility latencies:
histogram_quantile(0.95, sum(rate(task_latency_bucket{operation=~"VisibilityTask.*", service_name="history"}[1m])) by (operation, le))
- Resources exhausted due to limits:
sum(rate(service_errors_resource_exhausted{}[1m])) by (resource_exhausted_cause)