We encountered an issue where a workflow was started but its first workflow task was never picked up by a worker. This caused startUpdate calls to block indefinitely.
Workflow State
The workflow history shows only:
WorkflowExecutionStarted
WorkflowTaskScheduled
Missing: WorkflowTaskStarted - no worker ever picked up the task.
Note: Workers are registered and visible under the workflow’s “Workers” tab in the Temporal UI - they exist and are polling the correct task queue.
The workflow status via DescribeWorkflowExecution returns RUNNING, but it’s effectively a zombie - it can never process commands.
(Depiction of the problem; the termination was done manually by an operator)
Impact
When we call WorkflowStub.startUpdate() with WaitForStage.COMPLETED on this workflow, the call blocks indefinitely. Our calling thread never returns.
Environment
Temporal Java SDK: 1.30.1
Temporal Server: 1.26.2
Using Spring Boot starter
Questions
1. How can this state occur? What scenarios lead to WorkflowTaskScheduled without WorkflowTaskStarted? Is this a known issue or expected under certain conditions (worker unavailability, resource exhaustion, etc.)?
2. How should we detect this? Is there a recommended way to identify “stuck” workflows before calling startUpdate? We considered checking for WorkflowTaskStarted in history, but is there a better approach?
3. How should we handle it? Once detected, should we:
Terminate and recreate the workflow?
Is there a way to “unstuck” it?
Should Temporal server handle this automatically?
4. Is there a timeout option for startUpdate? We couldn’t find a way to set a deadline on the gRPC call itself. UpdateOptions doesn’t seem to have a timeout setting.
If this unblocks one of these executions I think we need to start looking at your server metrics and your db. What persistence store do you use?
When we call WorkflowStub.startUpdate() with WaitForStage.COMPLETED on this workflow, the call blocks indefinitely. Our calling thread never returns.
What if any errors do you get if your try to signal this execution via code or cli? just try a bogus signal, workflow impl does not have to have a signal handler registered.
I did some more digging and discovered that when the zombie workflow was first created, we were seeing a lot of errors in our Server History cluster. Those errors painted a common theme:
I correlated this with our RDS metrics and its CPU was sitting at 99.5% at the time. While this is definitely a contributing factor, I don’t understand why it would result in a zombie workflow, as this suggests some kind of partial failure, rather than an all-or-nothing failure result.
We’re looking into scaling up our DB, but we’d still like to understand why we end up in this inconsistent zombie state, and whether this is a bug in our code or the Temporal SDK/server.
Were you able to unblock it with mentioned commands? If not and your latest finding i think might be data loss where you had shard movement that was also caused by db issues.
Would focus on your persistence errors around the time when execution was created.
Especially on history and matching service.
If you can share those graphs could look at see if anything specific stands out.
Another thing maybe to eliminate is use of worker versioning. If you for example started execution for some build id and didnt have workers for it, could lead to similar situation.
Given that you are able to still interact with this execution
yeah id try the workflow refresh (think you did terminate instead) but if workflow refresh does not help and you can eliminate fully maybe workers having some really high schedule to start latencies or not being able to drain backlog for long time, and you can eliminate worker versioning from being used/be part of equation, then think we are looking at potential task loss issue where yeah having persistence errors can help