To Setup SLO for Error Rate, Which metric would be the best - service_errors or service_errors_with_type excluding client error types ?
service_error_with_type
excluding client error types
which operations are you including in your slo check?
Just a context We have self hosted Temporal Cluster. We want to setup alerts when there is actual server side problem.
As per Temporal OSS code, All the expected and client side behaviours errors are excluded for metric service_errors.
func isExpectedErrorByType(err error) bool {
// This is not a full list of service errors.
// Only errors with status code that fails the isExpectedErrorByStatusCode() check
// but are actually expected need to be explicitly handled here.
//
// Some of the errors listed below does not failed the isExpectedErrorByStatusCode() check
// but are listed nonetheless.
switch err := err.(type) {
case *serviceerror.ResourceExhausted:
return err.Scope == enumspb.RESOURCE_EXHAUSTED_SCOPE_NAMESPACE
case *serviceerror.Canceled,
*serviceerror.AlreadyExists,
*serviceerror.CancellationAlreadyRequested,
*serviceerror.FailedPrecondition,
*serviceerror.NamespaceInvalidState,
*serviceerror.NamespaceNotActive,
*serviceerror.NamespaceNotFound,
*serviceerror.NamespaceAlreadyExists,
*serviceerror.InvalidArgument,
*serviceerror.WorkflowExecutionAlreadyStarted,
*serviceerror.WorkflowNotReady,
*serviceerror.NotFound,
*serviceerror.QueryFailed,
*serviceerror.ClientVersionNotSupported,
*serviceerror.ServerVersionNotSupported,
*serviceerror.PermissionDenied,
*serviceerror.NewerBuildExists,
*serviceerrors.StickyWorkerUnavailable,
*serviceerrors.TaskAlreadyStarted,
*serviceerrors.RetryReplication,
*serviceerrors.SyncState:
return true
default:
return false
}
}
Then it will better to use this metric service_errors right instead of service_errors_with_type. As it will increase the overhead of maintaining error list to exclude ?