Monitoring a self-hosted cluster for uptime (SLO)

Hello!

I’m working on a team that’s self-hosting a Temporal cluster for other teams to use at our company. We’re interested in monitoring the cluster for the purpose of measuring its general availability and uptime and tracking this over time. We use DataDog fairly heavily here, so we’ve been leveraging DataDog’s SLO feature for this.

Our SLO metric is currently configured as:

Looking at how it’s done for Temporal Cloud, it seems like the certain client-side errors count against the SLO, which are not related to server-side issues, such as:

  • NotFound
  • InvalidArgument
  • TaskAlreadyStarted

Our SLO metric is configured against the service_error_with_type server-side metric. However, we’ve had to exclude a few error types as they seem common given our usage. For example, we commonly see the following errors:

  • serviceerror_notfound
  • serviceerror_invalidargument
  • serviceerror_workflowexecutionalreadystarted
  • serviceerror_cancelled
  • serviceerror_taskalreadystarted

I was curious how other folks are going about this?
Does it make sense to exclude the above errors?
Does it make sense to exclude any additional errors?

Thanks all!

1 Like

However, we’ve had to exclude a few error types as they seem common given our usage.

I think it’s is ok to exclude the ones you mentioned as well as service errors for which no metrics are emitted on the sever side. These include (there is some duplication):
frontend service
history service
matching service
telemetry interceptor

Hope this helps.