How to identify what was failing when "Failure processing workflow task" happens

We see error log like this

java.lang.RuntimeException: Failure processing workflow task. WorkflowId=3f22bd16-1e38-3134-9381-93bb15ccc4ee, RunId=fadced22-7be5-4b59-b8c1-c7badb761032
	at io.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.wrapFailure(WorkflowWorker.java:342)
	at io.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.wrapFailure(WorkflowWorker.java:280)
	at io.temporal.internal.worker.PollTaskExecutor.lambda$process$0(PollTaskExecutor.java:79)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
	at java.base/java.lang.Thread.run(Thread.java:832)
Caused by: io.grpc.StatusRuntimeException: INVALID_ARGUMENT: UnhandledCommand
	at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:262)
	at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:243)
	at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:156)
	at io.temporal.api.workflowservice.v1.WorkflowServiceGrpc$WorkflowServiceBlockingStub.respondWorkflowTaskCompleted(WorkflowServiceGrpc.java:2667)
	at io.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.lambda$sendReply$0(WorkflowWorker.java:376)
	at io.temporal.internal.common.GrpcRetryer.lambda$retry$0(GrpcRetryer.java:109)
	at io.temporal.internal.common.GrpcRetryer.retryWithResult(GrpcRetryer.java:127)
	at io.temporal.internal.common.GrpcRetryer.retry(GrpcRetryer.java:106)
	at io.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.sendReply(WorkflowWorker.java:369)
	at io.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:318)
	at io.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:280)
	at io.temporal.internal.worker.PollTaskExecutor.lambda$process$0(PollTaskExecutor.java:73)
	... 3 common frames omitted

We found the workflow and it seems to have recovered. (Side question: is it possible to make these not ERROR log? Our system pages when see ERROR log)

This happens frequently. We feel there might be some bug. How do we figure out what was the failed task trying to do? How do we map it to the line of code?

Looking at it from the dashboard all we can do is guess that the workflow task was going to close the workflow but a signal came in before it was closed. But we are really not sure.

Thank you very much!

Our @workflowmethod and @signalmethod

  @Override
  public void start(String orderId) {
    this.orderId = orderId;

    // TODO: Change from config in code to read from DB
    Map<String, Object> config = Config.getConfig();
    Duration maxDuration = (Duration) config.get("newStatusMaxDuration");
    LOGGER.info(
        String.format(
            "Workflow started for order %s to check if it passed new status (new, onHold and toBeVaidated) in time. maxDuration: %s",
            orderId, maxDuration.toString().substring(2).toLowerCase()));

    Workflow.await(maxDuration, () -> this.passed);
    if (this.passed) {
      LOGGER.info(
          String.format(
              "Order %s passed new status (new, onHold and toBeVaidated) before fulfilledMaxDuration: %s",
              orderId, maxDuration.toString()));
      return;
    }
    String message =
        String.format(
            "Order %s did not passed new status (new, onHold and toBeVaidated) after fulfilledMaxDuration: %s",
            orderId, maxDuration.toString().substring(2).toLowerCase());
    LOGGER.error(message);
    //    act.error(NewStatusWorkflowImpl.class.getName(), message);
  }

  @Override
  public void updateStatus(String status) {
    LOGGER.info(
        String.format(
            "Workflow for order %s received signal to update status to: %s", this.orderId, status));
    if (!NEW_ORDER_STATUS.contains(status)) {
      this.passed = true;
    }
  }

This is not a bug. This error is used to ensure that signals are not lost due to race condition with workflow completion command. Would you file an issue to change the log level?

This topic has some info on the details.

Thank you @maxim
I will file an issue on Github
I still have the question of looking at the Workflow task failure from the UI or the log, how do I know what was the task? Which line of code was it trying to execute?

@maxim Any thoughts on the general question that I asked? Looking at the UI/log for a workflow task failure, how to identify the line of code the task was trying to run?

The unhandled command is not associated with any particular line in the workflow as it is caused by workflow completion. So talking about a particular line that caused workflow completion doesn’t make sense.

You are right. What I asked does not make sense.
Will every ‘unhandled command’ means the task was trying to complete the workflow? If that is the case, would it be better to change the ‘unhandled command’ to ‘complete workflow failed’?
If there are other cases for ‘unhandled command’, what could they be?

Naming is hard :). At this point we are not going to make any backward incompatible changes to APIs if possible.

If there are other cases for ‘unhandled command’, what could they be?

Any event that was added to the history while the workflow task that decided to close workflow was runnig.