StartWorkflowExecution
SignalWorkflowExecution
SignalWithStartWorkflowExecution
These operations are not dependent on your sdk workers, so would measure both latencies and errors (service_latency, service_error_with_type). High latencies will typically be associated with db latencies so might check persistence_latency too if these elevated.
Note that UpdateWorkflowExecution means something different as a persistence operation, than frontend. For frontend it associated with workflow updates, in persistence latencies, UpdateWorkflowExecution is any update to a workflow execution (signal/activity schedule or complete/child workflow schedule or complete….) which is on the persistence latencies and error side pretty important to monitor, but not always as a frontend operation, will add more info on this below.
So for errors yeah would alert on Unavailable, Internal, DeadlineExceeded, and possibly ResourceExhausted but be mindful of resource exhausted cause again as mentioned as for example cause of RpsLimit might not be issue but actually expected, but things like BusyWorkflow/SystemOverloaded might be causes you might want to alert on.
Errors such as FailedPrecondition, WorkflowExecutionAlreadyStarted and even Canceled (if its a client cancel) can be expected and probably not something you want to alert on, but be aware that its happening.
DescribeWorkflowExecution
is similar in that its not operation that depends on sdk workers. in this case frontend service direc queries primary db. NotFound or NamespaceNotActive in this case can be a completely normal response, but Unavailable, Internal, DeadlineExceeded again can point to potential db issues/latencies.
QueryWorkflow
GetWorkflowExecutionResult
UpdateWorkflowExecution
These operations do depend on your sdk workers (UpdateWorkflowExecution depending on the update completion stage set, so be aware of that)
You still alert on Unavailable/Internal, but things like Canceled, DeadlineExceeded, ResourceExhausted (again check type) WorkflowNotReady, NotFound, can be expected responses in diff cases depending on sdk workers. WorkflowNotReady can also be expected here if workflow execution is “stuck” meaning its in a state where workflow task is failing due to some intermittent error in workflow code or non-deterministic issues.
GetWorkflowExecutionHistory
GetWorkflowExecutionResult
There are two calls here, GetWorkflowExecutionHistory is explicitly done by your sdk workers when they need to get exec event history (to perform event history replay for example).
Clients sync-waiting for execution completion actually do PollWorkflowExecutionHistory operation
which iirc is what GetWorkflowExecutionResult is on frontend side (sorry cant 100% remember)
Be aware of errors here, similarly alert on Unavailable/Internal/DeadlineExceeded but note that worker/client can cancel these that can lead to Canceled .
ListWorkflowExecutions
This is a visibility api call so can indicate possible issues with visibility store (high latencies/intermittent issues). DeadlineExceeded can indicate possible high latencies on vis store so good to check vis latencies:
histogram_quantile(0.95, sum(rate(task_latency_bucket{operation=~"VisibilityTask.*", service_name="history"}[1m])) by (operation, le))
PollWorkflowTaskQueue
PollActivityTaskQueue
These are your worker poll operations. Canceled can be related to workers shutting down/being restarted. ResourceExhausted in most cases is related to RpsLimit and expected if rps limits for namespace breached. Can also be ConcurrentLimit if breaches your dynamic config value set via
frontend.namespaceCount
RegisterNamespace
Similar to other operations that do not depend on sdk workers as what to monitor.
Hope this help.