Monitoring a self-hosted cluster for uptime (SLO)

amajedi · September 26, 2022, 2:07pm

Hello!

I’m working on a team that’s self-hosting a Temporal cluster for other teams to use at our company. We’re interested in monitoring the cluster for the purpose of measuring its general availability and uptime and tracking this over time. We use DataDog fairly heavily here, so we’ve been leveraging DataDog’s SLO feature for this.

Our SLO metric is currently configured as:

Looking at how it’s done for Temporal Cloud, it seems like the certain client-side errors count against the SLO, which are not related to server-side issues, such as:

NotFound
InvalidArgument
TaskAlreadyStarted

Our SLO metric is configured against the service_error_with_type server-side metric. However, we’ve had to exclude a few error types as they seem common given our usage. For example, we commonly see the following errors:

serviceerror_notfound
serviceerror_invalidargument
serviceerror_workflowexecutionalreadystarted
serviceerror_cancelled
serviceerror_taskalreadystarted

I was curious how other folks are going about this?
Does it make sense to exclude the above errors?
Does it make sense to exclude any additional errors?

Thanks all!

tihomir · September 28, 2022, 7:13pm

However, we’ve had to exclude a few error types as they seem common given our usage.

I think it’s is ok to exclude the ones you mentioned as well as service errors for which no metrics are emitted on the sever side. These include (there is some duplication):
frontend service
history service
matching service
telemetry interceptor

Hope this helps.

Topic		Replies	Views
Query around Error metrics for Temporal's internal service(s) Community Support metrics	1	657	February 20, 2023
Operation failed with error Community Support general-impl	1	572	February 8, 2022
Guidance on Logging Temporal Trace Errors with OpenTelemetry & Datadog Community Support go-sdk	6	154	May 28, 2025
Is there a way to categorize the metrics available from temporal server and client? Community Support metrics	6	985	March 9, 2022
History service errors Community Support general-impl	1	384	June 23, 2023

Monitoring a self-hosted cluster for uptime (SLO)

Related topics