WorkflowTaskFailed: Failure handling event 5 of 'EVENT TYPE ACTIVITY TASK SCHEDULED'

Hi all,
I’m trying to investigate some weird workflow failures that our temporal worker encountered – we have a long-running activity that routinely signals the workflow, and for much of the day, it waits, sending messages every 30 minutes that it is in await mode. This is not done from the activity code but in a python program called by the activity. In testing this has worked perfectly fine, but for some reason, the last time it was ran, it ran into workflow timeout then failures of this kind each time it was signaled:

i  {
"eventTime": "2021-03-15T19:22:09.000Z",
"eventType": "WorkflowTaskFailed",
"eventId": "33",
"details": {
  "scheduledEventId": "30",
  "startedEventId": "32",
  "cause": "WORKFLOW_TASK_FAILED_CAUSE_UNSPECIFIED",
  "failure": {
    "message": "Failure handling event 5 of 'EVENT_TYPE_ACTIVITY_TASK_SCHEDULED' type. IsReplayng=true, PreviousStartedEventId=3, workflowTaskStartedEventId=32, Currently Processing StartedEventId=3",
    "source": "JavaSDK",
    "stackTrace": "io.temporal.internal.statemachines.WorkflowStateMachines.handleEvent(WorkflowStateMachines.java:193)\nio.temporal.internal.replay.ReplayWorkflowRunTaskHandler.handleEvent(ReplayWorkflowRunTaskHandler.java:140)\nio.temporal.internal.replay.ReplayWorkflowRunTaskHandler.handleWorkflowTaskImpl(ReplayWorkflowRunTaskHandler.java:180)\nio.temporal.internal.replay.ReplayWorkflowRunTaskHandler.handleWorkflowTask(ReplayWorkflowRunTaskHandler.java:150)\nio.temporal.internal.replay.ReplayWorkflowTaskHandler.handleWorkflowTaskWithEmbeddedQuery(ReplayWorkflowTaskHandler.java:202)\nio.temporal.internal.replay.ReplayWorkflowTaskHandler.handleWorkflowTask(ReplayWorkflowTaskHandler.java:112)\nio.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:309)\nio.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:275)\nio.temporal.internal.worker.PollTaskExecutor.lambda$process$0(PollTaskExecutor.java:73)\njava.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\njava.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\njava.lang.Thread.run(Thread.java:748)\n",
    "cause": {
      "message": "Event 5 of EVENT_TYPE_ACTIVITY_TASK_SCHEDULED does not match command COMMAND_TYPE_COMPLETE_WORKFLOW_EXECUTION",
      "source": "JavaSDK",
      "stackTrace": "io.temporal.internal.statemachines.WorkflowStateMachines.handleCommandEvent(WorkflowStateMachines.java:263)\nio.temporal.internal.statemachines.WorkflowStateMachines.handleEventImpl(WorkflowStateMachines.java:199)\nio.temporal.internal.statemachines.WorkflowStateMachines.handleEvent(WorkflowStateMachines.java:178)\nio.temporal.internal.replay.ReplayWorkflowRunTaskHandler.handleEvent(ReplayWorkflowRunTaskHandler.java:140)\nio.temporal.internal.replay.ReplayWorkflowRunTaskHandler.handleWorkflowTaskImpl(ReplayWorkflowRunTaskHandler.java:180)\nio.temporal.internal.replay.ReplayWorkflowRunTaskHandler.handleWorkflowTask(ReplayWorkflowRunTaskHandler.java:150)\nio.temporal.internal.replay.ReplayWorkflowTaskHandler.handleWorkflowTaskWithEmbeddedQuery(ReplayWorkflowTaskHandler.java:202)\nio.temporal.internal.replay.ReplayWorkflowTaskHandler.handleWorkflowTask(ReplayWorkflowTaskHandler.java:112)\nio.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:309)\nio.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:275)\nio.temporal.internal.worker.PollTaskExecutor.lambda$process$0(PollTaskExecutor.java:73)\njava.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\njava.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\njava.lang.Thread.run(Thread.java:748)\n",
      "cause": null,
      "applicationFailureInfo": {
        "type": "java.lang.IllegalStateException",
        "nonRetryable": false,
        "details": null
      },
      "failureInfo": "applicationFailureInfo"
    },
    "applicationFailureInfo": {
      "type": "io.temporal.internal.replay.InternalWorkflowTaskException",
      "nonRetryable": false,
      "details": null
    },
    "failureInfo": "applicationFailureInfo"
  },
  "identity": "1@singer-worker-dep-577444955c-zfx6d",
  "baseRunId": "",
  "newRunId": "",
  "forkEventVersion": "0",
  "binaryChecksum": "",
  "eventId": "33",
  "eventType": "WorkflowTaskFailed",
  "kvps": [
    {
      "key": "eventTime",
      "value": "Mar 15th 3:22:09 pm"
    },
    {
      "key": "eventId",
      "value": "33"
    },
    {
      "key": "scheduledEventId",
      "value": "30"
    },
    {
      "key": "startedEventId",
      "value": "32"
    },
    {
      "key": "cause",
      "value": "WORKFLOW_TASK_FAILED_CAUSE_UNSPECIFIED"
    },
    {
      "key": "failure",
      "value": "io.temporal.internal.replay.InternalWorkflowTaskException: Failure handling event 5 of 'EVENT_TYPE_ACTIVITY_TASK_SCHEDULED' type. IsReplayng=true, PreviousStartedEventId=3, workflowTaskStartedEventId=32, Currently Processing StartedEventId=3 \nio.temporal.internal.statemachines.WorkflowStateMachines.handleEvent(WorkflowStateMachines.java:193)\nio.temporal.internal.replay.ReplayWorkflowRunTaskHandler.handleEvent(ReplayWorkflowRunTaskHandler.java:140)\nio.temporal.internal.replay.ReplayWorkflowRunTaskHandler.handleWorkflowTaskImpl(ReplayWorkflowRunTaskHandler.java:180)\nio.temporal.internal.replay.ReplayWorkflowRunTaskHandler.handleWorkflowTask(ReplayWorkflowRunTaskHandler.java:150)\nio.temporal.internal.replay.ReplayWorkflowTaskHandler.handleWorkflowTaskWithEmbeddedQuery(ReplayWorkflowTaskHandler.java:202)\nio.temporal.internal.replay.ReplayWorkflowTaskHandler.handleWorkflowTask(ReplayWorkflowTaskHandler.java:112)\nio.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:309)\nio.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:275)\nio.temporal.internal.worker.PollTaskExecutor.lambda$process$0(PollTaskExecutor.java:73)\njava.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\njava.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\njava.lang.Thread.run(Thread.java:748)\n\nCaused By: java.lang.IllegalStateException: Event 5 of EVENT_TYPE_ACTIVITY_TASK_SCHEDULED does not match command COMMAND_TYPE_COMPLETE_WORKFLOW_EXECUTION \nio.temporal.internal.statemachines.WorkflowStateMachines.handleCommandEvent(WorkflowStateMachines.java:263)\nio.temporal.internal.statemachines.WorkflowStateMachines.handleEventImpl(WorkflowStateMachines.java:199)\nio.temporal.internal.statemachines.WorkflowStateMachines.handleEvent(WorkflowStateMachines.java:178)\nio.temporal.internal.replay.ReplayWorkflowRunTaskHandler.handleEvent(ReplayWorkflowRunTaskHandler.java:140)\nio.temporal.internal.replay.ReplayWorkflowRunTaskHandler.handleWorkflowTaskImpl(ReplayWorkflowRunTaskHandler.java:180)\nio.temporal.internal.replay.ReplayWorkflowRunTaskHandler.handleWorkflowTask(ReplayWorkflowRunTaskHandler.java:150)\nio.temporal.internal.replay.ReplayWorkflowTaskHandler.handleWorkflowTaskWithEmbeddedQuery(ReplayWorkflowTaskHandler.java:202)\nio.temporal.internal.replay.ReplayWorkflowTaskHandler.handleWorkflowTask(ReplayWorkflowTaskHandler.java:112)\nio.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:309)\nio.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:275)\nio.temporal.internal.worker.PollTaskExecutor.lambda$process$0(PollTaskExecutor.java:73)\njava.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\njava.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\njava.lang.Thread.run(Thread.java:748)\n"
    },
    {
      "key": "identity",
      "value": "1@singer-worker-dep-577444955c-zfx6d"
    },
    {
      "key": "baseRunId",
      "value": ""
    },
    {
      "key": "newRunId",
      "value": ""
    },
    {
      "key": "forkEventVersion",
      "value": "0"
    },
    {
      "key": "binaryChecksum",
      "value": ""
    }
  ],
  "eventTime": "Mar 15th 3:22:09 pm"
},

When I try to run the workflow again, the signal is successfully received and there are no failures or timeouts. I’m not entirely clear on what is the cause of failure here so any guidance is appreciated. Thanks!

Are you 100% sure that your workflow code is deterministic? Would you double-check if it complies with these constraints?

Could you post the workflow execution history? You can DM it to me if you don’t want to do it publicly.

I am sure that our workflow code is deterministic, I just checked again. I will DM you the history @maxim, thank you!

After reviewing the workflow code we found that it indeed wasn’t deterministic. The culprit is using system time instead of the workflow time:

// BUG! Non deterministic time handling!
Workflow.sleep(Duration.between(Instant.now(), singerConfig.schedule.startTime));

The correct version is:

Workflow.sleep(Duration.between(Workflow.currentTimeMillis(), singerConfig.schedule.startTime));