I’ve been evaluating Temporal and doing some performance tests. I’m using Postgres for persistence, and at one point it started experiencing a lot of I/O pressure, resulting in high persistence latency as reported by Temporal’s persistence_latency
metric (the average jumped to >2s).
In the the logs from Temporal history pods, I could see a lot of timeouts and context deadline exceeded
errors. Everything became extremely slow, the 95th p. of task_schedule_to_start_latency
hit 1000s, at which point it was clipped. However, this is probably something I should expect given the DB was very slow.
But there are some workflows that got stuck during that time. They were created 3 days ago, and their only event is still “WorkflowExecutionStarted” (and “Attempt” is 1).
I’m wondering how to handle these. Is there some way to give them a push so that they start executing?
Hi,
In regards to the stuck workflows, please answer the following questions so we can help you better:
1- Are use using Worker Versioning with Build IDs? (helps if you paste the worker options that you use.)
2- You mentioned the “only event is still “WorkflowExecutionStarted””, can you confirm that you are also seeing “WorkflowTaskStarted” event as well or the only event is “WorkflowExecutionStarted”?
Please also provide your server version and DB setup. Note that we only support postgresql with sync replication between replicas.
- I’m not using worker versioning (yet).
- “WorkflowExecutionStarted” is the only event I can see. I’m attaching a screenshot from the UI - note that the workflow history hasn’t changed for a couple days.
The server version I’m using is 1.24.2. PostgreSQL isn’t using sync replication.
Did you failover your PostgreSQL DB, manually or automatically?
I didn’t failover it at all. It’s not a production setup, I’ve been just testing Temporal by creating increasingly large batches of workflows. The DB just became slow due to high I/O pressure for a certain period of time.
BTW. Since this is just a test setup, it doesn’t contain any sensitive information, so I can provide a dump of the database.
The highest metric bucket for latency metrics is 1000s by default. I think that’s why task_schedule_to_start_latency
got capped there.
Re. the stuck workflow. A workflow contains only WorkflowExecutionStarted
event if it’s a cron or being retried. Since attempt is 1 here, it should be a cron workflow.
Can you share the details of the startEvent? Event details are available in the JSON
tab in webUI. Thanks.
It’s below. I’ve slightly redacted the names/IDs as they were referencing some business-related stuff.
{
"eventId": "1",
"eventTime": "2024-06-21T17:25:33.130175516Z",
"eventType": "WorkflowExecutionStarted",
"taskId": "66060288",
"workflowExecutionStartedEventAttributes": {
"workflowType": {
"name": "Workflow1"
},
"parentWorkflowNamespace": "default",
"parentWorkflowNamespaceId": "60b4bede-651a-4f68-838b-6dacb5a9c56a",
"parentWorkflowExecution": {
"workflowId": "fbe9779c-0228-4f64-920b-cfc504124bad",
"runId": "cb271c3b-05b4-434d-9113-ea55794dde40"
},
"parentInitiatedEventId": "8",
"taskQueue": {
"name": "queue",
"kind": "TASK_QUEUE_KIND_NORMAL"
},
"input": {
"payloads": [
{
"workflow_id": "fbe9779c-0228-4f64-920b-cfc504124bad",
"organization_id": "org-139",
"service_name": "action",
"pool": "action",
"priority": 3
}
]
},
"workflowRunTimeout": "0s",
"workflowTaskTimeout": "10s",
"originalExecutionRunId": "49c952d5-adf7-4656-b6b1-210daff660e8",
"firstExecutionRunId": "49c952d5-adf7-4656-b6b1-210daff660e8",
"attempt": 1,
"firstWorkflowTaskBackoff": "0s",
"memo": {},
"searchAttributes": {},
"header": {},
"workflowId": "fbe9779c-0228-4f64-920b-cfc504124bad",
"rootWorkflowExecution": {
"workflowId": "864dbf38-24ff-4b50-9853-4f57846a078b",
"runId": "8a424173-f19c-4a9c-a411-27d2433d1279"
}
}
}
Ah I see, it’s a child workflow, not a cron.
The first event suggests that its parent workflow has workflowID fbe9779c-0228-4f64-920b-cfc504124bad
and runID cb271c3b-05b4-434d-9113-ea55794dde40
.
But from your screenshot for this stuck childworkflow, it also has the same workflowID fbe9779c-0228-4f64-920b-cfc504124bad
(runID 49c952d5-adf7-4656-b6b1-210daff660e8
is different)
Can you confirm that’s correct and parent is using the same workflowID when starting the child workflow?
Would you mind also provide the history for the parent workflow (starting from eventID 8, which should be the childWorkflowInitiatedEvent)? JSON would be ideal. If that’s too long, the default history view should also work. But I will need the details for any event related to this child (event 8 the initiated event, and if exists, the child workflow started event.)
I have a guess re. what happened here, but need to check ^ info to confirm my theory.
Thanks,
Yichao
The parent and child workflows actually have different IDs, this is due to my redacting of the data before publishing it - sorry for the confusion. The parent workflow ID is fbe9779c-0228-4f64-920b-cfc504124bad
and the child workflow ID is (prefix)fbe9779c-0228-4f64-920b-cfc504124bad
. I decided to remove (prefix)
before posting it publicly as it contained an internal company reference.
I will send the parent history through a PM.
Thanks for PM me the history.
Parent child having different workflowID means my previous theory was wrong. I can see that there’s no child workflow started event in parent’s history, which is no supposed to happen and I actually have never seen this issue before.
You should be able to unblock the workflows by running the following command and please let me know it works or not.
tdbg -n <namespace name> workflow refresh-tasks --wid <parent workflowID> --rid <parent runID>
Before running that command though, please also dump the following information.
tdbg -n <namespace name> workflow describe --wid <parent workflowID> --rid <parent runID>
tdbg -n <namespace name> workflow describe --wid <child workflowID> --rid <child runID>
The output of the above describe command should contain a shard ID. Then run
tdbg shard describe --shard-id <parent shardID>
If you still have access to server logs, please search for Fail to process task
around the time of the child workflow initiated events. Look for logs for the parent workflow and and startChildWorkflowTask type.
If you see following messages for the parent workflow, please also let me know.
Task Processor: task event ID >= MS NextEventID, skip.
Transfer Task Processor: workflow mutable state not found, skip.
Thanks,
Yichao
Sorry for the late reply and thank you for sending me the detailed mutable state and shard info dump.
Those dumps looks good to me actually. In this case I think the only reasonable explanation is that the transfer task for starting the child workflow got DLQed after creating the child but before recording childWorkflowStarted in the parent’s history. You mentioned persistence latency was high at that time, so that may cause the task to keep failing due on timeout and eventually went to DLQ. I also verified that History task DLQ is enabled by default in server version v1.24.2.
We can confirm if that’s actually the case with tdbg. There’s a runbook for that. But here’s the TLDR:
- Run
tdbg dlq --dlq-version v2 list
and see there’s any queue name start with 1_
. If so, it means there are some dlq messages for the transfer queue.
- Get those transfer queue dlq messages with
tdbg dlq --dlq-version v2 read --dlq-type 1
and see if there’s any message for the parent workflow and has task_type: TASK_TYPE_TRANSFER_START_CHILD_EXECUTION
.
If you did find the message, that explains why the child got stuck. and you can use the merge
command to merge the task back, or if you have already unblocked the child with refresh-tasks, the purge
command can delete the dlq message.
If there’s no such DLQ message, please let me know.
Thanks,
Yichao