Query around Error metrics for Temporal's internal service(s)


We’re building dashboards and are looking for service errors that would be useful to track.
Here, we aren’t looking for errors related to client or the errors at service level which might occur because of incorrect or invalid data from client, we are looking for errors that occur purely because of temporal’s internal service(s) which can have impact on overall availability and stability of the system .
I stumbled upon errors(service_errors_entity_not_found, service_errors_resource_exhausted) that might be of use from metrics package - go.temporal.io/server/common/metrics - Go Packages
but could not find description around it. Could someone please suggest or point to the appropriate docs.

Thank you!

What’s your server version?
If it’s 1.17.0 or above you can use service_error_with_type
to identify frontend service errors:

sum(rate(service_error_with_type{service_type="frontend"}[5m])) by (error_type)

For specifics I think definitely monitor service_errors_resource_exhausted:

sum(rate(service_errors_resource_exhausted{}[1m])) by (operation, resource_exhausted_cause)

but can also monitor:


and auth