I start deploying monitor and autoscale for our temporal services, but I noticed avg(temporal_activity_schedule_to_start_latency.sum) value is extremely big while avg(temporal_activity_schedule_to_start_latency.count) is pretty small. I am pretty sure that we only triggered less than 100 activities and even if the unit is ms the latency is still strangely big. Also when I actually trigger those workflow I did’t feel a big latency and wondering if I am reading this metrics correctly?
For SDK metrics schedule to start latencies you could use queries (Grafana):
sum by (namespace, task_queue) (rate(temporal_activity_schedule_to_start_latency_bucket[5m]))
sum by (namespace, task_queue) (rate(temporal_workflow_task_schedule_to_start_latency_bucket[5m]))
Can you also show poll task queue latencies:
sum(rate(service_latency_bucket{operation=~"PollActivityTaskQueue|PollWorkflowTaskQueue"}[5m])) by (operation, le)