Workflow stuck in WorkflowTaskScheduled event (only in one workflow instance, not others)

Hello,

I have come across a case that has me confused. I have a workflow with two events in the history “WorkflowExecutionStarted” and “WorkflowTaskScheduled”, however, this workflow is never executed in the worker. However, other workflows, of the same type and queue, are executed correctly.

I found this thread with a similar question but it doesn’t seem to be my case. It only happens to me with that particular instance.

The history of the workflow (anonymized) is as follows:

[
  {
    "eventId": "1",
    "eventTime": "2023-07-26T10:02:14.048704996Z",
    "eventType": "WorkflowExecutionStarted",
    "version": "0",
    "taskId": "95420416",
    "workerMayIgnore": false,
    "workflowExecutionStartedEventAttributes": {
      "workflowType": {
        "name": "MyWorkflow"
      },
      "parentWorkflowNamespace": "",
      "parentWorkflowNamespaceId": "",
      "parentWorkflowExecution": null,
      "parentInitiatedEventId": "0",
      "taskQueue": {
        "name": "TASK_QUEUE",
        "kind": "Normal"
      },
      "input": {...},
      "workflowExecutionTimeout": "0s",
      "workflowRunTimeout": "0s",
      "workflowTaskTimeout": "10s",
      "continuedExecutionRunId": "",
      "initiator": "Unspecified",
      "continuedFailure": null,
      "lastCompletionResult": null,
      "originalExecutionRunId": "07429236-3dc0-45ba-9078-e9201b5636f5",
      "identity": "1@canary-server-f485b65-kd4x4",
      "firstExecutionRunId": "07429236-3dc0-45ba-9078-e9201b5636f5",
      "retryPolicy": null,
      "attempt": 1,
      "workflowExecutionExpirationTime": null,
      "cronSchedule": "",
      "firstWorkflowTaskBackoff": "0s",
      "memo": null,
      "searchAttributes": null,
      "prevAutoResetPoints": null,
      "header": {...},
      "parentInitiatedEventVersion": "0"
    }
  },
  {
    "eventId": "2",
    "eventTime": "2023-07-26T10:02:14.048773133Z",
    "eventType": "WorkflowTaskScheduled",
    "version": "0",
    "taskId": "95420417",
    "workerMayIgnore": false,
    "workflowTaskScheduledEventAttributes": {
      "taskQueue": {
        "name": "TASK_QUEUE",
        "kind": "Normal"
      },
      "startToCloseTimeout": "10s",
      "attempt": 1
    }
  }
]

And an example history of an instance that does work:

[
  {
    "eventId": "1",
    "eventTime": "2023-07-26T10:36:25.032491081Z",
    "eventType": "WorkflowExecutionStarted",
    "version": "0",
    "taskId": "96468992",
    "workerMayIgnore": false,
    "workflowExecutionStartedEventAttributes": {
      "workflowType": {
        "name": "MyWorkflow"
      },
      "parentWorkflowNamespace": "",
      "parentWorkflowNamespaceId": "",
      "parentWorkflowExecution": null,
      "parentInitiatedEventId": "0",
      "taskQueue": {
        "name": "TASK_QUEUE",
        "kind": "Normal"
      },
      "input": {...},
      "workflowExecutionTimeout": "0s",
      "workflowRunTimeout": "0s",
      "workflowTaskTimeout": "10s",
      "continuedExecutionRunId": "",
      "initiator": "Unspecified",
      "continuedFailure": null,
      "lastCompletionResult": null,
      "originalExecutionRunId": "3848415f-875a-4568-863e-79762b715dbc",
      "identity": "1@canary-f485b65-hdgnc",
      "firstExecutionRunId": "3848415f-875a-4568-863e-79762b715dbc",
      "retryPolicy": null,
      "attempt": 1,
      "workflowExecutionExpirationTime": null,
      "cronSchedule": "",
      "firstWorkflowTaskBackoff": "0s",
      "memo": null,
      "searchAttributes": null,
      "prevAutoResetPoints": null,
      "header": {...},
      "parentInitiatedEventVersion": "0"
    }
  },
  {
    "eventId": "2",
    "eventTime": "2023-07-26T10:36:25.032597464Z",
    "eventType": "WorkflowTaskScheduled",
    "version": "0",
    "taskId": "96468993",
    "workerMayIgnore": false,
    "workflowTaskScheduledEventAttributes": {
      "taskQueue": {
        "name": "TASK_QUEUE",
        "kind": "Normal"
      },
      "startToCloseTimeout": "10s",
      "attempt": 1
    }
  },
  {
    "eventId": "3",
    "eventTime": "2023-07-26T10:36:25.062110385Z",
    "eventType": "WorkflowTaskStarted",
    "version": "0",
    "taskId": "96468998",
    "workerMayIgnore": false,
    "workflowTaskStartedEventAttributes": {
      "scheduledEventId": "2",
      "identity": "server-worker@1@canary-f485b65-hdgnc@anon",
      "requestId": "99715e18-9a9a-4af2-84b0-93c7fc29a898",
      "suggestContinueAsNew": false,
      "historySizeBytes": "599"
    }
  },
  {
    "eventId": "4",
    "eventTime": "2023-07-26T10:36:25.105618185Z",
    "eventType": "WorkflowTaskCompleted",
    "version": "0",
    "taskId": "96469002",
    "workerMayIgnore": false,
    "workflowTaskCompletedEventAttributes": {
      "scheduledEventId": "2",
      "startedEventId": "3",
      "identity": "server-worker@1@canary-f485b65-hdgnc@anon",
      "binaryChecksum": "",
      "workerVersioningId": null,
      "sdkMetadata": null,
      "meteringMetadata": {
        "nonfirstLocalActivityExecutionAttempts": 0
      }
    }
  },
  ...
]

I’ve done some research but I can’t figure out why and I can’t think of any other ideas to test. Can anyone think of what might be going on?

The versions I use are:
Temporal: v1.20.0
Java SDK: 1.19.1

Thank you so much

Which database do you use to run the Temporal service?

I am using Temporal over Postgres 13.3

@maxim do you have any idea why this could happen? because seems that the task is never picked up by a worker, or maybe fails in a previous iteration but is never re enqueue

This looks like a task is lost which can be caused by DB corruption. BTW, how old is the Temporal version you are using? We could have bugs related to this that were fixed.

Hi,

We are using temporal server version v1.20.0, as @7eben commented before.

I was reviewing the github issues page and I can’t see any issue related to this.

What is your DB setup? Can it be using replication? Is it Aurora or self-hosted? We’ve seen some data loss in certain setups.

We are using Postgres 13.3 (as @7eben mentioned above) deployed in our self-hosted kubernetes cluster and installed using the Postgres operator provided by Zalando. We have activated the replication too.

About replication, we never have seen any data corruption or loss.

We don’t use any cloud resources, it is all using our own hardware.

In principle, I don’t think it’s relevant, but the issue occurred in a scenario where a namespace was first created in Temporal using the Java SDK and only a few milliseconds after receiving the OK create response, the dangling workflow is started. It could be some race condition between Temporal and our DB setup. Luckily for us, this happened in a pre-production environment. The setup is the one mentioned by @jrpedrianes in the previous message. In the production environment the setup is similar but with more resourceful machines. I wish I could replicate the issue in local development and be able to debug the Temporal code and shed more light, but the conditions are not the same (dev containers instead of replicated cluster), but if the ideas point to something like this maybe we should get our hands a bit dirtier and do more research.

As we are concerned about this happening in the Live environment, for now our best option may be to spend some time monitoring metrics and queries on the workflows that are being initiated in the production environments.

If anything, I’ll write back, thanks for the effort.