Monitoring and logging heartbeat timeouts

A couple questions related to monitoring and logging heartbeat timeouts:

  1. Is there a default metric for heartbeat timeouts by activity? I didn’t see one but could have missed it.
  2. If I wanted to log a message when an activity heartbeat timeout occurs, how could I do that? It seems like I could catch ActivityTimeoutException and check for TimeoutType=HEARTBEAT, but I can’t catch that from the workflow if I’m also using retries until the retries expire.
1 Like

Is there a default metric for heartbeat timeouts by activity? I didn’t see one but could have missed it.

When timed out activity heartbeat, the request is going to fail and the Java SDK will emit metric metric “temporal_request_failure” with tags:

  • “Operation” == “RecordActivityTaskHeartbeat”
  • “StatusCode” == “NOT_FOUND”.
  • “Namespace” == activity namespace
  • “ActivityType” == activity type
  • “WorkflowType” == type of the workflow that invoked the activity

Currently, the metric doesn’t discriminate why the activity is not valid anymore. The same failure is reported if the workflow has closed or activity timed out due to some other timeout (like start to close).

The server emits “heartbeat_timeout” metric, but it is tagged by namespace only at this point.

I believe the activity heartbeating or completion failures that return “NOT_FOUND” are logged by the SDK.

1 Like