Help with failure detection: Task processing failed with error. Activity Argument size failure. Context deadline exceeded. WorkflowTaskTimedOut

Hello. We had a workflow fail to spawn an activity because the arguments were too large. The workflow would attempt to spawn an activity, 10s after the attempt begins the following error (with tags) is printed by the worker:

Task processing failed with error
{
	"WorkerType": "WorkflowWorker",
	"level": "info",
	"Error": "context deadline exceeded",
	"TaskQueue": "FOO_BAR",
	"time": "2023-09-27T13:20:26Z",
	"WorkerID": "1@5a9f11d9c0c5@",
	"Namespace": "default"
}

No other logs are produced by the worker related to the error, even with verbose logging enabled. The metrics on the worker emit a temporal_request_failure counter with value of 1 on operation RespondWorkflowTaskCompleted. The workflow type is tagged, but there is no run or workflow id.

When we view the temporal UI the workflow is annotated with the following error (I only show a subset of the json here):

    "details": {
      "scheduledEventId": "9",
      "startedEventId": "10",
      "timeoutType": "StartToClose",
      "eventId": "11",
      "eventType": "WorkflowTaskTimedOut",
      "kvps": [
        {
          "key": "eventTime",
          "value": "Sep 26th 12:37:56 pm"
        },
        {
          "key": "eventId",
          "value": "11"
        },
        {
          "key": "scheduledEventId",
          "value": "9"
        },
        {
          "key": "startedEventId",
          "value": "10"
        },
        {
          "key": "timeoutType",
          "value": "StartToClose"
        }
      ],
      "eventTime": "Sep 26th 12:37:56 pm"
    }

Maybe also worth noting that we have tracing set up for the workflow and no errors are reported in the traces.

Questions:

  1. What’s the recommended way to catch errors like this and identify the affected workflow id and run? None of the reporting seems to include the relevant information to track down a specific run failure.
  2. The history service is aware of the error and reports it in the UI. I have enabled verbose logging on all temporal services but there are no logs of this error as far as I can tell. Is there a suggested way for us to listen for workflow errors and report on their content?

fwiw: we have a low workflow volume and are very sensitive to workflow failures so we are interested in setting up alerting where we can quickly identify individual workflow failures.

I was able to add a grpctrace.UnaryClientInterceptor to clientOptions.ConnectionOptions.DialOptions and grab the grpc request error as it’s happening. Are there possibly other interceptors that might provide insights?

I think generally I am mostly interested in understanding how I might listen for the creation of the WorkflowTaskTimedOut state additions in the workflow activity log.