Stuck workflows after hight database load

roast-keyboard · June 24, 2024, 11:38am

I’ve been evaluating Temporal and doing some performance tests. I’m using Postgres for persistence, and at one point it started experiencing a lot of I/O pressure, resulting in high persistence latency as reported by Temporal’s persistence_latency metric (the average jumped to >2s).

In the the logs from Temporal history pods, I could see a lot of timeouts and context deadline exceeded errors. Everything became extremely slow, the 95th p. of task_schedule_to_start_latency hit 1000s, at which point it was clipped. However, this is probably something I should expect given the DB was very slow.

But there are some workflows that got stuck during that time. They were created 3 days ago, and their only event is still “WorkflowExecutionStarted” (and “Attempt” is 1).

I’m wondering how to handle these. Is there some way to give them a push so that they start executing?

ShahabT · June 24, 2024, 4:47pm

Hi,
In regards to the stuck workflows, please answer the following questions so we can help you better:
1- Are use using Worker Versioning with Build IDs? (helps if you paste the worker options that you use.)
2- You mentioned the “only event is still “WorkflowExecutionStarted””, can you confirm that you are also seeing “WorkflowTaskStarted” event as well or the only event is “WorkflowExecutionStarted”?

ShahabT · June 24, 2024, 4:52pm

Please also provide your server version and DB setup. Note that we only support postgresql with sync replication between replicas.

roast-keyboard · June 25, 2024, 12:35pm

I’m not using worker versioning (yet).
“WorkflowExecutionStarted” is the only event I can see. I’m attaching a screenshot from the UI - note that the workflow history hasn’t changed for a couple days.

The server version I’m using is 1.24.2. PostgreSQL isn’t using sync replication.

Yimin_Chen · June 25, 2024, 5:23pm

Did you failover your PostgreSQL DB, manually or automatically?

roast-keyboard · June 26, 2024, 10:03am

I didn’t failover it at all. It’s not a production setup, I’ve been just testing Temporal by creating increasingly large batches of workflows. The DB just became slow due to high I/O pressure for a certain period of time.

BTW. Since this is just a test setup, it doesn’t contain any sensitive information, so I can provide a dump of the database.

Yichao_Yang · June 26, 2024, 7:05pm

The highest metric bucket for latency metrics is 1000s by default. I think that’s why task_schedule_to_start_latency got capped there.

Re. the stuck workflow. A workflow contains only WorkflowExecutionStarted event if it’s a cron or being retried. Since attempt is 1 here, it should be a cron workflow.
Can you share the details of the startEvent? Event details are available in the JSON tab in webUI. Thanks.

roast-keyboard · June 27, 2024, 1:19pm

It’s below. I’ve slightly redacted the names/IDs as they were referencing some business-related stuff.

{
  "eventId": "1",
  "eventTime": "2024-06-21T17:25:33.130175516Z",
  "eventType": "WorkflowExecutionStarted",
  "taskId": "66060288",
  "workflowExecutionStartedEventAttributes": {
    "workflowType": {
      "name": "Workflow1"
    },
    "parentWorkflowNamespace": "default",
    "parentWorkflowNamespaceId": "60b4bede-651a-4f68-838b-6dacb5a9c56a",
    "parentWorkflowExecution": {
      "workflowId": "fbe9779c-0228-4f64-920b-cfc504124bad",
      "runId": "cb271c3b-05b4-434d-9113-ea55794dde40"
    },
    "parentInitiatedEventId": "8",
    "taskQueue": {
      "name": "queue",
      "kind": "TASK_QUEUE_KIND_NORMAL"
    },
    "input": {
      "payloads": [
        {
          "workflow_id": "fbe9779c-0228-4f64-920b-cfc504124bad",
          "organization_id": "org-139",
          "service_name": "action",
          "pool": "action",
          "priority": 3
        }
      ]
    },
    "workflowRunTimeout": "0s",
    "workflowTaskTimeout": "10s",
    "originalExecutionRunId": "49c952d5-adf7-4656-b6b1-210daff660e8",
    "firstExecutionRunId": "49c952d5-adf7-4656-b6b1-210daff660e8",
    "attempt": 1,
    "firstWorkflowTaskBackoff": "0s",
    "memo": {},
    "searchAttributes": {},
    "header": {},
    "workflowId": "fbe9779c-0228-4f64-920b-cfc504124bad",
    "rootWorkflowExecution": {
      "workflowId": "864dbf38-24ff-4b50-9853-4f57846a078b",
      "runId": "8a424173-f19c-4a9c-a411-27d2433d1279"
    }
  }
}

Yichao_Yang · June 27, 2024, 9:26pm

Ah I see, it’s a child workflow, not a cron.

The first event suggests that its parent workflow has workflowID fbe9779c-0228-4f64-920b-cfc504124bad and runID cb271c3b-05b4-434d-9113-ea55794dde40.
But from your screenshot for this stuck childworkflow, it also has the same workflowID fbe9779c-0228-4f64-920b-cfc504124bad (runID 49c952d5-adf7-4656-b6b1-210daff660e8 is different)
Can you confirm that’s correct and parent is using the same workflowID when starting the child workflow?

Would you mind also provide the history for the parent workflow (starting from eventID 8, which should be the childWorkflowInitiatedEvent)? JSON would be ideal. If that’s too long, the default history view should also work. But I will need the details for any event related to this child (event 8 the initiated event, and if exists, the child workflow started event.)

I have a guess re. what happened here, but need to check ^ info to confirm my theory.

Thanks,
Yichao

roast-keyboard · June 28, 2024, 2:07pm

The parent and child workflows actually have different IDs, this is due to my redacting of the data before publishing it - sorry for the confusion. The parent workflow ID is fbe9779c-0228-4f64-920b-cfc504124bad and the child workflow ID is (prefix)fbe9779c-0228-4f64-920b-cfc504124bad. I decided to remove (prefix) before posting it publicly as it contained an internal company reference.

I will send the parent history through a PM.

Yichao_Yang · July 2, 2024, 12:40am

Thanks for PM me the history.

Parent child having different workflowID means my previous theory was wrong. I can see that there’s no child workflow started event in parent’s history, which is no supposed to happen and I actually have never seen this issue before.

You should be able to unblock the workflows by running the following command and please let me know it works or not.

tdbg -n <namespace name> workflow refresh-tasks --wid <parent workflowID> --rid <parent runID>

Before running that command though, please also dump the following information.

tdbg -n <namespace name> workflow describe --wid <parent workflowID> --rid <parent runID>
tdbg -n <namespace name> workflow describe --wid <child workflowID> --rid <child runID>

The output of the above describe command should contain a shard ID. Then run

tdbg shard describe --shard-id <parent shardID>

If you still have access to server logs, please search for Fail to process task around the time of the child workflow initiated events. Look for logs for the parent workflow and and startChildWorkflowTask type.

If you see following messages for the parent workflow, please also let me know.

Task Processor: task event ID >= MS NextEventID, skip.
Transfer Task Processor: workflow mutable state not found, skip.

Thanks,
Yichao

Yichao_Yang · July 11, 2024, 1:01am

Sorry for the late reply and thank you for sending me the detailed mutable state and shard info dump.

Those dumps looks good to me actually. In this case I think the only reasonable explanation is that the transfer task for starting the child workflow got DLQed after creating the child but before recording childWorkflowStarted in the parent’s history. You mentioned persistence latency was high at that time, so that may cause the task to keep failing due on timeout and eventually went to DLQ. I also verified that History task DLQ is enabled by default in server version v1.24.2.

We can confirm if that’s actually the case with tdbg. There’s a runbook for that. But here’s the TLDR:

Run tdbg dlq --dlq-version v2 list and see there’s any queue name start with 1_. If so, it means there are some dlq messages for the transfer queue.
Get those transfer queue dlq messages with tdbg dlq --dlq-version v2 read --dlq-type 1 and see if there’s any message for the parent workflow and has task_type: TASK_TYPE_TRANSFER_START_CHILD_EXECUTION.

If you did find the message, that explains why the child got stuck. and you can use the merge command to merge the task back, or if you have already unblocked the child with refresh-tasks, the purge command can delete the dlq message.

If there’s no such DLQ message, please let me know.

Thanks,
Yichao

Topic		Replies	Views
Workflow tasks stuck after db outage Community Support general-impl , php-sdk	1	274	December 5, 2023
Temporal workflow piling up with high load. Not completing quick enough Community Support python-sdk , performance , workflow-options , postgresql	1	254	August 2, 2024
Workflow Performance with Java SDK Community Support java-sdk	1	745	February 20, 2023
Workflow Activity gets stuck intermittently Community Support python-sdk , general-impl , activity , postgresql , workflow-implementat	7	115	March 29, 2025
[Urgent] Upgraded to 1.11.2 and making 11k calls/sec to Postgres Community Support typescript-sdk	19	191	October 23, 2024

Stuck workflows after hight database load

Related topics