Failed workflow task due to unhandled command

Hi,

We see below error sometimes (randomly) and then the workflow starts execution from beginning.

internal.worker.PollerOptions: Failed workflow task due to unhandled command. This error is likely recoverable. 
 java.lang.RuntimeException: Failure processing workflow task.
WorkflowId=3DF2EF2A728E41D59627207E46969686@AVABAgA, 
RunId=2d9ffd4a-5fa1-4a08-8e3f-9a0e2ec63cd0, Attempt=1

Is it something wrong with our code or our configuration that needs to be checked for?
Really appreciate if someone could throw more light as to why we see such errors.

hi @maxim,
If nothing wrong with our code and config and its not a bug in Temporal then what we are essentially saying is this is an expected behaviour?
Please note, we had no other workflow running concurrently when we see this error.

Also, from the error message it says “recoverable”, but we see it could not recover and simply marked as “WorkflowExecutionTimeout” after configured timeout.

If the workflow randomly fails for no reason then it is very difficult to rely on its execution.

Would you post the history of the workflow? The error message you posted is benign and should not lead to workflow failure or timeout.

Hi @maxim ,

We have parent-child workflow. In the child workflow, task at line 51 failed and got success on retry, as seen in marker record line 52. But workflow was marked as failed and the below log line was seen.

Failed workflow task due to unhandled command. This error is likely recoverable. 
 java.lang.RuntimeException: Failure processing workflow task.

Child workflow history

Parent workflow history

The child workflow was marked as timed out. It looks like it was waiting for some external event while this happened. What is the reason to set that child workflow timeout to 5 minutes? Is this business transaction is not relevant after a 5-minute delay?

It looks like the parent workflow didn’t handle the child workflow failure. So it failed. The “unhandled command” is benign. It means that the parent workflow decided to complete at the same time the child reported failure. So the parent had to retry the workflow task to take the child workflow failure into the account.

The child workflow was marked as timed out.

This is the puzzle and the question, even though child workflow completed in 27s, nothing happened and eventually timedout. As mentioned before, we see said exception “Unhandled command”

What is the reason to set that child workflow timeout to 5 minutes? Is this business transaction is not relevant after a 5-minute delay?

Yup, this is the business requirement. Anyways our workflow completes mostly under 50 sec.

Unhandled command happened in the parent workflow. So it didn’t affect the child in any way. My guess is that the child workflow code has a bug when it gets stuck. You can try reproducing it in a debugger by downloading the workflow history using WorkflowReplayer.

But any specific reason why we see the said error? “Unhandled command”
Any config change recommended?

if an event (like signal, activity completion, child workflow completion) is received while workflow task that decides to close workflow (or calls continue as new) is executing then the workflow task result is ignored (by returning “Unhandled Command”) and the task is retried to give the workflow chance to process the new event. It works like a transactional memory that is rolled back to the state before the signal was received.

So the behavior you see is by design and doesn’t require any changes on your side.