Workflow stuck with WorkflowTaskScheduled but no WorkflowTaskStarted

We encountered an issue where a workflow was started but its first workflow task was never picked up by a worker. This caused startUpdate calls to block indefinitely.

Workflow State

The workflow history shows only:

  1. WorkflowExecutionStarted
  2. WorkflowTaskScheduled

Missing: WorkflowTaskStarted - no worker ever picked up the task.

Note: Workers are registered and visible under the workflow’s “Workers” tab in the Temporal UI - they exist and are polling the correct task queue.

The workflow status via DescribeWorkflowExecution returns RUNNING, but it’s effectively a zombie - it can never process commands.

(Depiction of the problem; the termination was done manually by an operator)

Impact

When we call WorkflowStub.startUpdate() with WaitForStage.COMPLETED on this workflow, the call blocks indefinitely. Our calling thread never returns.

Environment

  • Temporal Java SDK: 1.30.1
  • Temporal Server: 1.26.2
  • Using Spring Boot starter

Questions

1. How can this state occur? What scenarios lead to WorkflowTaskScheduled without WorkflowTaskStarted? Is this a known issue or expected under certain conditions (worker unavailability, resource exhaustion, etc.)?

2. How should we detect this? Is there a recommended way to identify “stuck” workflows before calling startUpdate? We considered checking for WorkflowTaskStarted in history, but is there a better approach?

3. How should we handle it? Once detected, should we:

  • Terminate and recreate the workflow?

  • Is there a way to “unstuck” it?

  • Should Temporal server handle this automatically?

4. Is there a timeout option for startUpdate? We couldn’t find a way to set a deadline on the gRPC call itself. UpdateOptions doesn’t seem to have a timeout setting.

-–

Anything else to adjust?

Would first check if refresh task works, via tctl try:
tctl adm wf rt --namespace_id value --workflow_id value

and see if that helps (see workflow task started, if there really is workers polling workflow tasks on this task queue)

you can get the namespace id for specific namespace via tctl:

tctl n desc namespace_name

or temporal cli

temporal operator namespace describe --namespace namespace_name

If this unblocks one of these executions I think we need to start looking at your server metrics and your db. What persistence store do you use?

When we call WorkflowStub.startUpdate() with WaitForStage.COMPLETED on this workflow, the call blocks indefinitely. Our calling thread never returns.

What if any errors do you get if your try to signal this execution via code or cli? just try a bogus signal, workflow impl does not have to have a signal handler registered.