Hello!
I’m working on a team that’s self-hosting a Temporal cluster for other teams to use at our company. We’re interested in monitoring the cluster for the purpose of measuring its general availability and uptime and tracking this over time. We use DataDog fairly heavily here, so we’ve been leveraging DataDog’s SLO feature for this.
Our SLO metric is currently configured as:
Looking at how it’s done for Temporal Cloud, it seems like the certain client-side errors count against the SLO, which are not related to server-side issues, such as:
- NotFound
- InvalidArgument
- TaskAlreadyStarted
Our SLO metric is configured against the service_error_with_type
server-side metric. However, we’ve had to exclude a few error types as they seem common given our usage. For example, we commonly see the following errors:
- serviceerror_notfound
- serviceerror_invalidargument
- serviceerror_workflowexecutionalreadystarted
- serviceerror_cancelled
- serviceerror_taskalreadystarted
I was curious how other folks are going about this?
Does it make sense to exclude the above errors?
Does it make sense to exclude any additional errors?
Thanks all!