We encountered an issue where a workflow was started but its first workflow task was never picked up by a worker. This caused startUpdate calls to block indefinitely.
Workflow State
The workflow history shows only:
WorkflowExecutionStartedWorkflowTaskScheduled
Missing: WorkflowTaskStarted - no worker ever picked up the task.
Note: Workers are registered and visible under the workflow’s “Workers” tab in the Temporal UI - they exist and are polling the correct task queue.
The workflow status via DescribeWorkflowExecution returns RUNNING, but it’s effectively a zombie - it can never process commands.
(Depiction of the problem; the termination was done manually by an operator)
Impact
When we call WorkflowStub.startUpdate() with WaitForStage.COMPLETED on this workflow, the call blocks indefinitely. Our calling thread never returns.
Environment
- Temporal Java SDK: 1.30.1
- Temporal Server: 1.26.2
- Using Spring Boot starter
Questions
1. How can this state occur? What scenarios lead to WorkflowTaskScheduled without WorkflowTaskStarted? Is this a known issue or expected under certain conditions (worker unavailability, resource exhaustion, etc.)?
2. How should we detect this? Is there a recommended way to identify “stuck” workflows before calling startUpdate? We considered checking for WorkflowTaskStarted in history, but is there a better approach?
3. How should we handle it? Once detected, should we:
-
Terminate and recreate the workflow?
-
Is there a way to “unstuck” it?
-
Should Temporal server handle this automatically?
4. Is there a timeout option for startUpdate? We couldn’t find a way to set a deadline on the gRPC call itself. UpdateOptions doesn’t seem to have a timeout setting.
-–
Anything else to adjust?
