Client metrics to detect connectivity issues

Hi team,

I was taking a look to the client metrics in this class but was unable to find more documentation to understand what are the main differences between long_request_failure and request_failure.

I want to use a metric to detect when our roles lost connectivity with Temporal server (even though we do not have running workflows at that moment).

Could you provide more information on the metrics above and any recommendation to create a monitor that alert us when there is not connectivity between client and server?

Thanks in advance.

what are the main differences between long_request_failure and request_failure

Client operations (SDK client APIs) that are async, for example

WorkflowClient.start(...);
...
workflowStub.mySignalMethod(...);
... 
async activity/child workflow invocations

their failures would fall under request_failure bucket.

On the other hand things like workers long-polling their task queue to get workflow tasks, or any sync client api calls such as:

typedStub.myWorkflowMethod(...); // waits for wf to compelte
untypedStub.getResult(....); // potentially waits for completion
typedStub.myQueryMethod(); // waits for query to complete
or any sync child workflow/activity invocations

their failures would fall under the long_request_failure bucket.

Note these are not business level failures, but failures due to gRPC request issues to the Temporal frontend service (if io.grpc.Status is not “OK”, see here).

any recommendation to create a monitor that alert us when there is not connectivity between client and server?

If you are asking for connection failures from client to server, you could alert on temporal_request_failure and temporal_long_request_failure buckets, as well as associated “_latency” buckets (SDK metrics docs here, note SDK metrics are prefixed with “temporal_”.