Cause of "No command scheduled for event" non-determinism error?

In one of our workflows we are getting this non-determinism error that I don’t understand:

{
  "message": "Nondeterminism(\"No command scheduled for event HistoryEvent(id: 1204, Some(StartChildWorkflowExecutionInitiated))\")",
  "source": "",
  "stackTrace": "",
  "encodedAttributes": null,
  "cause": null,
  "applicationFailureInfo": {
    "type": "",
    "nonRetryable": false,
    "details": null
  }
}

What does it mean that there is no command scheduled for an event?

The workflow ran for less than 30 mins and there was definitely no incompatible workflow code change during that time period.

Here are the relevant events from the event history, including the event with the ID 1204 and other events referencing that ID:

    {
      "eventId": "1204",
      "eventTime": "2023-07-02T11:37:45.351883423Z",
      "eventType": "StartChildWorkflowExecutionInitiated",
      "version": "0",
      "taskId": "20987162",
      "workerMayIgnore": false,
      "startChildWorkflowExecutionInitiatedEventAttributes": {
        "namespace": "default",
        "namespaceId": "f843e2f7-30fc-495e-ae28-7a0c1ff72ea2",
        "workflowId": "inspection-core-run_5ec3d230-f0f9-46ea-bfca-839faeb343d7_0457713d-1807-43b3-b72c-f8ad3fa502d7_bf71a66d-6d94-4bb5-be0f-45ad073b8d3f/e12c1207-d7f8-4144-b188-628422e88317_d53bf124-6dc6-4c27-8484-dab71cb57cd2",
        "workflowType": {
          "name": "inspect"
        },
        "taskQueue": {
          "name": "inspection-workflow/us-ea-srq9",
          "kind": "Normal"
        },
        "input": {
          "payloads": [ ]
        },
        "workflowExecutionTimeout": null,
        "workflowRunTimeout": "0s",
        "workflowTaskTimeout": "120s",
        "parentClosePolicy": "Terminate",
        "control": "",
        "workflowTaskCompletedEventId": "1190",
        "workflowIdReusePolicy": "AllowDuplicate",
        "retryPolicy": null,
        "cronSchedule": "",
        "header": {
          "fields": {}
        },
        "memo": {
          "fields": {}
        },
        "searchAttributes": {
          "indexedFields": {}
        }
      }
    },
    {
      "eventId": "1246",
      "eventTime": "2023-07-02T11:37:45.661675656Z",
      "eventType": "ChildWorkflowExecutionStarted",
      "version": "0",
      "taskId": "20987233",
      "workerMayIgnore": false,
      "childWorkflowExecutionStartedEventAttributes": {
        "namespace": "default",
        "namespaceId": "f843e2f7-30fc-495e-ae28-7a0c1ff72ea2",
        "initiatedEventId": "1204",
        "workflowExecution": {
          "workflowId": "inspection-core-run_5ec3d230-f0f9-46ea-bfca-839faeb343d7_0457713d-1807-43b3-b72c-f8ad3fa502d7_bf71a66d-6d94-4bb5-be0f-45ad073b8d3f/e12c1207-d7f8-4144-b188-628422e88317_d53bf124-6dc6-4c27-8484-dab71cb57cd2",
          "runId": "1714286f-630f-4686-91c9-cc656eaceb5c"
        },
        "workflowType": {
          "name": "inspect"
        },
        "header": {
          "fields": {}
        }
      }
    },
    {
      "eventId": "1404",
      "eventTime": "2023-07-02T11:41:00.943727941Z",
      "eventType": "ChildWorkflowExecutionCompleted",
      "version": "0",
      "taskId": "20987637",
      "workerMayIgnore": false,
      "childWorkflowExecutionCompletedEventAttributes": {
        "result": {
          "payloads": [ ]
        },
        "namespace": "default",
        "namespaceId": "f843e2f7-30fc-495e-ae28-7a0c1ff72ea2",
        "workflowExecution": {
          "workflowId": "inspection-core-run_5ec3d230-f0f9-46ea-bfca-839faeb343d7_0457713d-1807-43b3-b72c-f8ad3fa502d7_bf71a66d-6d94-4bb5-be0f-45ad073b8d3f/e12c1207-d7f8-4144-b188-628422e88317_d53bf124-6dc6-4c27-8484-dab71cb57cd2",
          "runId": "1714286f-630f-4686-91c9-cc656eaceb5c"
        },
        "workflowType": {
          "name": "inspect"
        },
        "initiatedEventId": "1204",
        "startedEventId": "1246"
      }
    },
    {
      "eventId": "1556",
      "eventTime": "2023-07-02T11:45:48.174443934Z",
      "eventType": "WorkflowTaskFailed",
      "version": "0",
      "taskId": "20988052",
      "workerMayIgnore": false,
      "workflowTaskFailedEventAttributes": {
        "scheduledEventId": "1554",
        "startedEventId": "1555",
        "cause": "NonDeterministicError",
        "failure": {
          "message": "Nondeterminism(\"No command scheduled for event HistoryEvent(id: 1204, Some(StartChildWorkflowExecutionInitiated))\")",
          "source": "",
          "stackTrace": "",
          "encodedAttributes": null,
          "cause": null,
          "applicationFailureInfo": {
            "type": "",
            "nonRetryable": false,
            "details": null
          }
        },
        "identity": "8@inspection-workflow-698b5df6d4-kq4jg",
        "baseRunId": "",
        "newRunId": "",
        "forkEventVersion": "0",
        "binaryChecksum": "@temporalio/worker@1.4.4+215cf040ff09e11dc911b5abd1971ab145947d006965a0bbcaf9dfe3610c64a5"
      }
    },

In total, there are 15 WorkflowTaskFailed events all referencing the 1204 event ID.

I didn’t find any documentation on this specific non-determinism error. I found that it can be raised by the handle_command_event function in the server. But I don’t understand how a None value would get pushed to the queue.

This means that the first time the workflow ran it requested to start a child workflow but when it was replayed it did not generate any new commands (e.g. timers, activities, child workflows, send signal).

Typically non determinism happens during a deployment of incompatible workflow code.
Possible reasons for this are:

  1. We’ve fixed some issues in the 1.7.4 release fixing some edge cases that might lead to non deterministic behavior. From the history, looks like you’re running 1.4.4 which we know has issues.
  2. Are you mutating workflow state from query handlers? This may cause non deterministic behavior as query execution is not recorded in workflow history.
  3. Can you confirm that there are no old workers around polling the task queue? You may be able to see if multiple worker versions have processed this workflow by checking the binaryChecksum on the WorkflowTaskCompleted events.

During the time period during which the workflow ran there were no code deployments to any of the workers. There are also no query handlers for this workflow. All the checksums are identical so there were no old worker versions, either.

But your point about the worker version (1.4.4) is interesting. I had not realized that this particular worker is running such an old version. All of our other workers are on version 1.7.2 currently. I’ll get this worker updated to the latest version as well.

But this part is still not quite clear to me:

This means that the first time the workflow ran it requested to start a child workflow but when it was replayed it did not generate any new commands (e.g. timers, activities, child workflows, send signal).

At the beginning of this workflow there is an activity that fetches some data from S3 and executes a bunch of child workflows based on that data. But since activity results are cached once the activity has completed successfully, I would assume that the same child workflows get created every time the parent workflow gets replayed. According to the event history, all the child workflows had already been completed successfully as well, when this non-deterministic error occurred. So I’m still a bit puzzled as to what actually happened, unless it’s literally a bug with the older worker version.

But this part is still not quite clear to me:

According to the event history, all the child workflows had already been completed successfully as well, when this non-deterministic error occurred. So I’m still a bit puzzled as to what actually happened, unless it’s literally a bug with the older worker version.

If your Workflow code hasn’t changed in that time period, then it is very likely that you hit an SDK bug. We fixed a few bugs that in the 1.5.x to 1.7.4 period that had the potential of resulting in non-deterministic behaviors.

I can’t tell for sure exactly which of these bugs have caused the specific error you observed (or at least, not without looking at your workflow code and/or full workflow history), but in the end, it resulted in a different sequence of commands being executed while replaying the workflow history vs when the Workflow code was first executed.

Thanks @bergundy & @jwatkins for your quick responses! We will be updating the worker version on priority.