Workflow stuck in WorkflowTaskScheduled event (only in one workflow instance, not others)

7eben · July 26, 2023, 12:40pm

Hello,

I have come across a case that has me confused. I have a workflow with two events in the history “WorkflowExecutionStarted” and “WorkflowTaskScheduled”, however, this workflow is never executed in the worker. However, other workflows, of the same type and queue, are executed correctly.

I found this thread with a similar question but it doesn’t seem to be my case. It only happens to me with that particular instance.

The history of the workflow (anonymized) is as follows:

[
  {
    "eventId": "1",
    "eventTime": "2023-07-26T10:02:14.048704996Z",
    "eventType": "WorkflowExecutionStarted",
    "version": "0",
    "taskId": "95420416",
    "workerMayIgnore": false,
    "workflowExecutionStartedEventAttributes": {
      "workflowType": {
        "name": "MyWorkflow"
      },
      "parentWorkflowNamespace": "",
      "parentWorkflowNamespaceId": "",
      "parentWorkflowExecution": null,
      "parentInitiatedEventId": "0",
      "taskQueue": {
        "name": "TASK_QUEUE",
        "kind": "Normal"
      },
      "input": {...},
      "workflowExecutionTimeout": "0s",
      "workflowRunTimeout": "0s",
      "workflowTaskTimeout": "10s",
      "continuedExecutionRunId": "",
      "initiator": "Unspecified",
      "continuedFailure": null,
      "lastCompletionResult": null,
      "originalExecutionRunId": "07429236-3dc0-45ba-9078-e9201b5636f5",
      "identity": "1@canary-server-f485b65-kd4x4",
      "firstExecutionRunId": "07429236-3dc0-45ba-9078-e9201b5636f5",
      "retryPolicy": null,
      "attempt": 1,
      "workflowExecutionExpirationTime": null,
      "cronSchedule": "",
      "firstWorkflowTaskBackoff": "0s",
      "memo": null,
      "searchAttributes": null,
      "prevAutoResetPoints": null,
      "header": {...},
      "parentInitiatedEventVersion": "0"
    }
  },
  {
    "eventId": "2",
    "eventTime": "2023-07-26T10:02:14.048773133Z",
    "eventType": "WorkflowTaskScheduled",
    "version": "0",
    "taskId": "95420417",
    "workerMayIgnore": false,
    "workflowTaskScheduledEventAttributes": {
      "taskQueue": {
        "name": "TASK_QUEUE",
        "kind": "Normal"
      },
      "startToCloseTimeout": "10s",
      "attempt": 1
    }
  }
]

And an example history of an instance that does work:

[
  {
    "eventId": "1",
    "eventTime": "2023-07-26T10:36:25.032491081Z",
    "eventType": "WorkflowExecutionStarted",
    "version": "0",
    "taskId": "96468992",
    "workerMayIgnore": false,
    "workflowExecutionStartedEventAttributes": {
      "workflowType": {
        "name": "MyWorkflow"
      },
      "parentWorkflowNamespace": "",
      "parentWorkflowNamespaceId": "",
      "parentWorkflowExecution": null,
      "parentInitiatedEventId": "0",
      "taskQueue": {
        "name": "TASK_QUEUE",
        "kind": "Normal"
      },
      "input": {...},
      "workflowExecutionTimeout": "0s",
      "workflowRunTimeout": "0s",
      "workflowTaskTimeout": "10s",
      "continuedExecutionRunId": "",
      "initiator": "Unspecified",
      "continuedFailure": null,
      "lastCompletionResult": null,
      "originalExecutionRunId": "3848415f-875a-4568-863e-79762b715dbc",
      "identity": "1@canary-f485b65-hdgnc",
      "firstExecutionRunId": "3848415f-875a-4568-863e-79762b715dbc",
      "retryPolicy": null,
      "attempt": 1,
      "workflowExecutionExpirationTime": null,
      "cronSchedule": "",
      "firstWorkflowTaskBackoff": "0s",
      "memo": null,
      "searchAttributes": null,
      "prevAutoResetPoints": null,
      "header": {...},
      "parentInitiatedEventVersion": "0"
    }
  },
  {
    "eventId": "2",
    "eventTime": "2023-07-26T10:36:25.032597464Z",
    "eventType": "WorkflowTaskScheduled",
    "version": "0",
    "taskId": "96468993",
    "workerMayIgnore": false,
    "workflowTaskScheduledEventAttributes": {
      "taskQueue": {
        "name": "TASK_QUEUE",
        "kind": "Normal"
      },
      "startToCloseTimeout": "10s",
      "attempt": 1
    }
  },
  {
    "eventId": "3",
    "eventTime": "2023-07-26T10:36:25.062110385Z",
    "eventType": "WorkflowTaskStarted",
    "version": "0",
    "taskId": "96468998",
    "workerMayIgnore": false,
    "workflowTaskStartedEventAttributes": {
      "scheduledEventId": "2",
      "identity": "server-worker@1@canary-f485b65-hdgnc@anon",
      "requestId": "99715e18-9a9a-4af2-84b0-93c7fc29a898",
      "suggestContinueAsNew": false,
      "historySizeBytes": "599"
    }
  },
  {
    "eventId": "4",
    "eventTime": "2023-07-26T10:36:25.105618185Z",
    "eventType": "WorkflowTaskCompleted",
    "version": "0",
    "taskId": "96469002",
    "workerMayIgnore": false,
    "workflowTaskCompletedEventAttributes": {
      "scheduledEventId": "2",
      "startedEventId": "3",
      "identity": "server-worker@1@canary-f485b65-hdgnc@anon",
      "binaryChecksum": "",
      "workerVersioningId": null,
      "sdkMetadata": null,
      "meteringMetadata": {
        "nonfirstLocalActivityExecutionAttempts": 0
      }
    }
  },
  ...
]

I’ve done some research but I can’t figure out why and I can’t think of any other ideas to test. Can anyone think of what might be going on?

The versions I use are:
Temporal: v1.20.0
Java SDK: 1.19.1

Thank you so much

maxim · July 26, 2023, 4:38pm

Which database do you use to run the Temporal service?

7eben · July 26, 2023, 6:44pm

I am using Temporal over Postgres 13.3

jrpedrianes · July 28, 2023, 2:10pm

@maxim do you have any idea why this could happen? because seems that the task is never picked up by a worker, or maybe fails in a previous iteration but is never re enqueue

maxim · July 29, 2023, 4:54am

This looks like a task is lost which can be caused by DB corruption. BTW, how old is the Temporal version you are using? We could have bugs related to this that were fixed.

jrpedrianes · July 29, 2023, 1:19pm

Hi,

We are using temporal server version v1.20.0, as @7eben commented before.

I was reviewing the github issues page and I can’t see any issue related to this.

maxim · July 30, 2023, 6:19pm

What is your DB setup? Can it be using replication? Is it Aurora or self-hosted? We’ve seen some data loss in certain setups.

jrpedrianes · July 31, 2023, 6:22am

We are using Postgres 13.3 (as @7eben mentioned above) deployed in our self-hosted kubernetes cluster and installed using the Postgres operator provided by Zalando. We have activated the replication too.

About replication, we never have seen any data corruption or loss.

We don’t use any cloud resources, it is all using our own hardware.

7eben · July 31, 2023, 6:35am

In principle, I don’t think it’s relevant, but the issue occurred in a scenario where a namespace was first created in Temporal using the Java SDK and only a few milliseconds after receiving the OK create response, the dangling workflow is started. It could be some race condition between Temporal and our DB setup. Luckily for us, this happened in a pre-production environment. The setup is the one mentioned by @jrpedrianes in the previous message. In the production environment the setup is similar but with more resourceful machines. I wish I could replicate the issue in local development and be able to debug the Temporal code and shed more light, but the conditions are not the same (dev containers instead of replicated cluster), but if the ideas point to something like this maybe we should get our hands a bit dirtier and do more research.

As we are concerned about this happening in the Live environment, for now our best option may be to spend some time monitoring metrics and queries on the workflows that are being initiated in the production environments.

If anything, I’ll write back, thanks for the effort.

Topic		Replies	Views
ActivityTaskScheduled but never invoked Community Support go-sdk	2	441	July 29, 2023
Activity is scheduled but does not start Community Support	7	1557	March 31, 2021
Scheduled WorkflowTasks occasionally not starting Community Support	15	737	July 1, 2024
Workflow stuck in WorkflowTaskScheduled event Community Support typescript-sdk	1	191	April 29, 2024
Preventing single workflow failure from blocking the task queue Community Support java-sdk	6	3217	September 10, 2020

Workflow stuck in WorkflowTaskScheduled event (only in one workflow instance, not others)

Related topics