I was taking a look to the client metrics in this class but was unable to find more documentation to understand what are the main differences between long_request_failure and request_failure.
I want to use a metric to detect when our roles lost connectivity with Temporal server (even though we do not have running workflows at that moment).
Could you provide more information on the metrics above and any recommendation to create a monitor that alert us when there is not connectivity between client and server?
Thanks in advance.
what are the main differences between long_request_failure and request_failure
Client operations (SDK client APIs) that are async, for example
async activity/child workflow invocations
their failures would fall under request_failure bucket.
On the other hand things like workers long-polling their task queue to get workflow tasks, or any sync client api calls such as:
typedStub.myWorkflowMethod(...); // waits for wf to compelte
untypedStub.getResult(....); // potentially waits for completion
typedStub.myQueryMethod(); // waits for query to complete
or any sync child workflow/activity invocations
their failures would fall under the long_request_failure bucket.
Note these are not business level failures, but failures due to gRPC request issues to the Temporal frontend service (if io.grpc.Status is not “OK”, see here).
any recommendation to create a monitor that alert us when there is not connectivity between client and server?
If you are asking for connection failures from client to server, you could alert on
temporal_long_request_failure buckets, as well as associated “_latency” buckets (SDK metrics docs here, note SDK metrics are prefixed with “temporal_”.