Categorizing error emitted from service_error_with_type

I’m working on error-rate alerting to detect Temporal server-side degradation (vs client misuse or expected workflow errors).
From frontend metrics, we see these error types:
InvalidArgument, FailedPrecondition, NotFound, NamespaceAlreadyExists, NamespaceNotActive, WorkflowExecutionAlreadyStarted, WorkflowNotReady, QueryFailed, Canceled, DeadlineExceeded, ResourceExhausted, Unavailable, Internal
We’re trying to answer:

  1. Which of these are the best indicators of true server/platform degradation?

  2. How should DeadlineExceeded, Canceled, and ResourceExhausted be treated for platform-level alerts (signal vs noise)?

  3. Any guidance on categorizing errors into client-side vs server-side?

Server metrics query

sum(rate(service_error_with_type{service_type=“frontend”}[1m])) by (error_type)

Typically you would alert on Unavailable and Internal as that can be indication of service instability and you need to start possibly looking at other metrics like persistence errors, service errors for other service_type like history/matching/worker and server logs in some cases to start troubleshooting deeper

DeadlineExceededmore often then not is going to be related to db latencies/errors/intermittent issues so worth it as well

Canceledmost of time related to client (as in sdk workers/client that starts/signals/updates/queries) executions cancels the request before server could give response back

ResourceExhausteddepends on resource exhausted cause. if its things like RpsLimit or ConcurrentLimit then can be expected if workloads exceeds rps limits you define for namespace or global limits / concurrent poller limits you define per namespace or globally across all frontends
the one cause you are looking for alerting is SystemOverloadedin which case you are overloading db (persistence) and might want to look at adjusting qps limits in dynamic config

There is an additional layer to talk about this which is the operation, as a lot of times these service responses are tied to specific operations for which they might not be something you should alert on as well…if you want list also all operations you see and we can try go through them. In general tho what i wrote here as far as “definitely alert” applies across all operations

Thanks @tihomir
Thanks for the detailed explanation — this is very helpful.

For additional context, the frontend operations we’re currently observing are:

StartWorkflowExecution
SignalWorkflowExecution
SignalWithStartWorkflowExecution
QueryWorkflow
DescribeWorkflowExecution
GetWorkflowExecutionHistory
GetWorkflowExecutionResult
UpdateWorkflowExecution
ListWorkflowExecutions
PollWorkflowTaskQueue
PollActivityTaskQueue
RegisterNamespace

Please let us know if there are specific operations from the above list where error semantics differ and should be handled differently for alerting.

StartWorkflowExecution
SignalWorkflowExecution
SignalWithStartWorkflowExecution

These operations are not dependent on your sdk workers, so would measure both latencies and errors (service_latency, service_error_with_type). High latencies will typically be associated with db latencies so might check persistence_latency too if these elevated.
Note that UpdateWorkflowExecution means something different as a persistence operation, than frontend. For frontend it associated with workflow updates, in persistence latencies, UpdateWorkflowExecution is any update to a workflow execution (signal/activity schedule or complete/child workflow schedule or complete….) which is on the persistence latencies and error side pretty important to monitor, but not always as a frontend operation, will add more info on this below.
So for errors yeah would alert on Unavailable, Internal, DeadlineExceeded, and possibly ResourceExhausted but be mindful of resource exhausted cause again as mentioned as for example cause of RpsLimit might not be issue but actually expected, but things like BusyWorkflow/SystemOverloaded might be causes you might want to alert on.
Errors such as FailedPrecondition, WorkflowExecutionAlreadyStarted and even Canceled (if its a client cancel) can be expected and probably not something you want to alert on, but be aware that its happening.

DescribeWorkflowExecution

is similar in that its not operation that depends on sdk workers. in this case frontend service direc queries primary db. NotFound or NamespaceNotActive in this case can be a completely normal response, but Unavailable, Internal, DeadlineExceeded again can point to potential db issues/latencies.

QueryWorkflow
GetWorkflowExecutionResult
UpdateWorkflowExecution

These operations do depend on your sdk workers (UpdateWorkflowExecution depending on the update completion stage set, so be aware of that)
You still alert on Unavailable/Internal, but things like Canceled, DeadlineExceeded, ResourceExhausted (again check type) WorkflowNotReady, NotFound, can be expected responses in diff cases depending on sdk workers. WorkflowNotReady can also be expected here if workflow execution is “stuck” meaning its in a state where workflow task is failing due to some intermittent error in workflow code or non-deterministic issues.

GetWorkflowExecutionHistory
GetWorkflowExecutionResult

There are two calls here, GetWorkflowExecutionHistory is explicitly done by your sdk workers when they need to get exec event history (to perform event history replay for example).
Clients sync-waiting for execution completion actually do PollWorkflowExecutionHistory operation
which iirc is what GetWorkflowExecutionResult is on frontend side (sorry cant 100% remember)
Be aware of errors here, similarly alert on Unavailable/Internal/DeadlineExceeded but note that worker/client can cancel these that can lead to Canceled .

ListWorkflowExecutions

This is a visibility api call so can indicate possible issues with visibility store (high latencies/intermittent issues). DeadlineExceeded can indicate possible high latencies on vis store so good to check vis latencies:
histogram_quantile(0.95, sum(rate(task_latency_bucket{operation=~"VisibilityTask.*", service_name="history"}[1m])) by (operation, le))

PollWorkflowTaskQueue
PollActivityTaskQueue

These are your worker poll operations. Canceled can be related to workers shutting down/being restarted. ResourceExhausted in most cases is related to RpsLimit and expected if rps limits for namespace breached. Can also be ConcurrentLimit if breaches your dynamic config value set via
frontend.namespaceCount

RegisterNamespace

Similar to other operations that do not depend on sdk workers as what to monitor.

Hope this help.

And just to add so you dont go after me when your alerts in Unavailable/Internal trigger at 3am :wink:
If you are doing expected service restarts (for example server version upgrades) or service pod scaling (up or down) it is expected for clients to see some rate of these responses during this period.

If its unexpected, yes then alert.
For history hosts you can watch these metrics as well and alert on spikes when unexpected. Can be related to high history host memory (and cpu) for example, or high db latencies too:

sum(rate(sharditem_created_count{service_name=”history”}[1m]))
sum(rate(sharditem_removed_count{service_name=”history”}[1m]))
sum(rate(sharditem_closed_count{service_name=”history”}[1m]))

Thanks @tihomir, this is very clear and answers our questions well.

We’ll treat Unavailable, DeadlineExceeded and Internal as primary platform-degradation signals.

We’ll exclude Canceled from platform-level alerting and handle ResourceExhausted conditionally alerting only on SystemOverloaded, while treating rate/concurrency limit violations as expected behavior.

This gives us a solid framework to separate true server-side degradation from client behavior. Appreciate the detailed guidance