Over the past few weeks, we’ve noticed a handful of our Workflows, all running on the same task queue, have had their first WorkflowTask scheduled, but never started. We’ve seen it happen five times in one our namespaces, three times on one particular day, each about two hours apart, and then once three days later, and one more time a day after that.
Other workflows running on the same task queue have run successfully around the same time/after these workflows got stuck, and there is a worker listening to that task queue. Can someone explain why this small group are not making progress?
Here is some relevant output from temporal CLI, showing one of the stuck workflows, and info about the task queue (with some minor edits to the names of things):
temporal@temporal-frontend-65b4df7b7b-ccfdp:/etc/temporal$ temporal workflow show -w ffdf6b5b-aee9-35da-81d2-4b4509384c1b
Progress:
ID Time Type
1 2024-05-20T06:08:20Z WorkflowExecutionStarted
2 2024-05-20T06:08:20Z WorkflowTaskScheduled
3 2024-05-20T07:04:30Z WorkflowExecutionCancelRequested
Result:
temporal@temporal-frontend-65b4df7b7b-ccfdp:/etc/temporal$ temporal workflow describe -w ffdf6b5b-aee9-35da-81d2-4b4509384c1b
{
"executionConfig": {
"taskQueue": {
"name": "yyy.zzz.task.queue",
"kind": "Normal"
},
"workflowExecutionTimeout": "0s",
"workflowRunTimeout": "0s",
"defaultWorkflowTaskTimeout": "10s"
},
"workflowExecutionInfo": {
"execution": {
"workflowId": "ffdf6b5b-aee9-35da-81d2-4b4509384c1b",
"runId": "3951299c-9ec4-47ad-bbb3-927d23754933"
},
"type": {
"name": "GetYYYWorkflow"
},
"startTime": "2024-05-20T06:08:20.176719278Z",
"status": "Running",
"historyLength": "3",
"parentNamespaceId": "4a379ce6-7259-4e1d-80c2-bbadbf6bb725",
"parentExecution": {
"workflowId": "YYY:j2wSQHCPIy/6ajRhDOoAKuJoq9Qdg59kVTrVAT8SMyg=",
"runId": "41587bf8-7eb0-4e57-a066-9d14c4bd0004"
},
"executionTime": "2024-05-20T06:08:20.176719278Z",
"memo": {
},
"autoResetPoints": {
},
"stateTransitionCount": "3",
"historySizeBytes": "12062"
},
"pendingWorkflowTask": {
"state": "Scheduled",
"scheduledTime": "2024-05-20T06:08:20.200742918Z",
"originalScheduledTime": "2024-05-20T06:08:20.200742661Z",
"attempt": 1
}
}
temporal@temporal-frontend-65b4df7b7b-ccfdp:/etc/temporal$ temporal task-queue describe -t "yyy.zzz.task.queue"
Identity LastAccessTime RatePerSecond
1@my-worker-zzzzzzzzzzzz-767486cfc-9rc5z 25 seconds ago 100000
temporal@temporal-frontend-65b4df7b7b-ccfdp:/etc/temporal$ temporal task-queue get-build-ids -t "yyy.zzz.task.queue"
BuildIds DefaultForSet IsDefaultSet
[my-service-1.973.0 my-service-1.985.0 true
my-service-1.977.0
my-service-1.985.0]
temporal@temporal-frontend-65b4df7b7b-ccfdp:/etc/temporal$ temporal task-queue get-build-id-reachability -t "yyy.zzz.task.queue" --build-id my-service-1.985.0
BuildId TaskQueue Reachability
my-service-1.985.0 yyy.zzz.task.queue [NewWorkflows
ExistingWorkflows]
As a point of comparison, here is the describe
output for an execution of the same workflow type, on the same task queue, several days later:
temporal@temporal-frontend-65b4df7b7b-ccfdp:/etc/temporal$ temporal workflow describe -w 7e74a0dc-25cc-314e-aa2c-4f741718f074
{
"executionConfig": {
"taskQueue": {
"name": "yyy.zzz.task.queue",
"kind": "Normal"
},
"workflowExecutionTimeout": "0s",
"workflowRunTimeout": "0s",
"defaultWorkflowTaskTimeout": "10s"
},
"workflowExecutionInfo": {
"execution": {
"workflowId": "7e74a0dc-25cc-314e-aa2c-4f741718f074",
"runId": "d2df027d-e3f5-4941-9fb5-f621eaf15421"
},
"type": {
"name": "GetYYYWorkflow"
},
"startTime": "2024-05-28T11:12:54.087547236Z",
"closeTime": "2024-05-28T11:12:54.580375549Z",
"status": "Completed",
"historyLength": "15",
"parentNamespaceId": "4a379ce6-7259-4e1d-80c2-bbadbf6bb725",
"parentExecution": {
"workflowId": "YYY:f26lz90mlkPuaTXfZebfzsv/fqw++dNc8Lz5ryq00kQ=",
"runId": "37165d3d-4e25-4f70-8e8e-1af8f984258e"
},
"executionTime": "2024-05-28T11:12:54.087547236Z",
"memo": {
},
"searchAttributes": {
"indexedFields": {
"BuildIds": "[\"versioned:my-service-1.977.0\"]"
}
},
"autoResetPoints": {
"points": [
{
"binaryChecksum": "my-service-1.977.0",
"runId": "d2df027d-e3f5-4941-9fb5-f621eaf15421",
"firstWorkflowTaskCompletedId": "4",
"createTime": "2024-05-28T11:12:54.224141722Z",
"resettable": true
}
]
},
"stateTransitionCount": "8",
"historySizeBytes": "27301",
"mostRecentWorkerVersionStamp": {
"buildId": "my-service-1.977.0",
"useVersioning": true
}
}
}
We’re using Postgres for persistence, and running Temporal server 1.21.5. Thanks!