Handling Internal Temporal Exceptions Without Disrupting Workflow Execution

I’m currently working with Temporal workflows using the Java SDK and Temporal version 1.17.0. I’ve encountered a StatusRuntimeException with the message NOT_FOUND: Workflow task not found. This exception occurs during the process of reporting workflow progress to the server. Here is a snippet of the stack trace for context:

[worker] [Workflow Executor taskQueue="task-queue", namespace="namespace"] [] i.t.internal.worker.WorkflowWorker: Failure while reporting workflow progress to the server. If seen continuously the workflow might be stuck. WorkflowId=workflow-id, RunId=run-id, startedEventId=34
io.grpc.StatusRuntimeException: NOT_FOUND: Workflow task not found.
	at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:271)
	at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:252)
	at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:165)
	at io.temporal.api.workflowservice.v1.WorkflowServiceGrpc$WorkflowServiceBlockingStub.respondWorkflowTaskCompleted(WorkflowServiceGrpc.java:3764)
	at io.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.lambda$sendTaskCompleted$0(WorkflowWorker.java:369)
	at io.temporal.internal.retryer.GrpcSyncRetryer.retry(GrpcSyncRetryer.java:67)
	at io.temporal.internal.retryer.GrpcRetryer.retryWithResult(GrpcRetryer.java:60)
	at io.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.sendTaskCompleted(WorkflowWorker.java:364)
	at io.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:259)
	at io.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:188)
	at io.temporal.internal.worker.PollTaskExecutor.lambda$process$0(PollTaskExecutor.java:93)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:842)

I would like to log this exception as a metric for monitoring purposes, but I want to ensure that it doesn’t interrupt the workflow execution process.

What is the best practice for catching and handling this exception in a way that allows me to log it as a metric without affecting the workflow’s execution?
Any insights or examples from your experience would be greatly appreciated!

Thank you!

What is the best practice for catching and handling this exception in a way that allows me to log it as a metric without affecting the workflow’s execution?

SDK already reports this metric, temporal_request_failure which you can filter by operation and status_code.
For your case operation would be RespondWorkflowTaskCompleted and status_code would be NOT_FOUND

Perfect, thank you!