CPU Usage Metrics

Are there any guidelines on metrics to watch out for, for temporal services? We have a very high cpu usage on temporal ecs task and we are trying to understand the spike.

We are adding more cpu, however, we cant find any guidelines on how determine these(cpu, memory..) parameters.

Are there guidelines on what alerts/thresholds to set for health monitoring?

Currently we have alerts set to 80% of cpu utilization on the history service.

We appreciate any guidelines/directions on the topic.

Thanks!

If you have server metrics start maybe with requests and errors rates:

sum by (operation) (rate(service_requests{service_name=“history”}[$rate]))

(same for service errors - service_errors)
and lets go from there

For memory:

avg(cache_usage{cache_type=“mutablestate”})
avg(cache_pinned_usage{cache_type=“mutablestate”})

also cache_requests:

sum(rate(cache_requests{cache_type=“events”,operation=“EventsCachePutEvent”}[1m]))
sum(rate(cache_requests{cache_type=“events”,operation=“EventsCacheGetFromStore”}[1m]))

see if there is correlation of these to your resource utilization