Workflow stuck with WorkflowTaskScheduled but no WorkflowTaskStarted

We encountered an issue where a workflow was started but its first workflow task was never picked up by a worker. This caused startUpdate calls to block indefinitely.

Workflow State

The workflow history shows only:

  1. WorkflowExecutionStarted
  2. WorkflowTaskScheduled

Missing: WorkflowTaskStarted - no worker ever picked up the task.

Note: Workers are registered and visible under the workflow’s “Workers” tab in the Temporal UI - they exist and are polling the correct task queue.

The workflow status via DescribeWorkflowExecution returns RUNNING, but it’s effectively a zombie - it can never process commands.

(Depiction of the problem; the termination was done manually by an operator)

Impact

When we call WorkflowStub.startUpdate() with WaitForStage.COMPLETED on this workflow, the call blocks indefinitely. Our calling thread never returns.

Environment

  • Temporal Java SDK: 1.30.1
  • Temporal Server: 1.26.2
  • Using Spring Boot starter

Questions

1. How can this state occur? What scenarios lead to WorkflowTaskScheduled without WorkflowTaskStarted? Is this a known issue or expected under certain conditions (worker unavailability, resource exhaustion, etc.)?

2. How should we detect this? Is there a recommended way to identify “stuck” workflows before calling startUpdate? We considered checking for WorkflowTaskStarted in history, but is there a better approach?

3. How should we handle it? Once detected, should we:

  • Terminate and recreate the workflow?

  • Is there a way to “unstuck” it?

  • Should Temporal server handle this automatically?

4. Is there a timeout option for startUpdate? We couldn’t find a way to set a deadline on the gRPC call itself. UpdateOptions doesn’t seem to have a timeout setting.

-–

Anything else to adjust?

Would first check if refresh task works, via tctl try:
tctl adm wf rt --namespace_id value --workflow_id value

and see if that helps (see workflow task started, if there really is workers polling workflow tasks on this task queue)

you can get the namespace id for specific namespace via tctl:

tctl n desc namespace_name

or temporal cli

temporal operator namespace describe --namespace namespace_name

If this unblocks one of these executions I think we need to start looking at your server metrics and your db. What persistence store do you use?

When we call WorkflowStub.startUpdate() with WaitForStage.COMPLETED on this workflow, the call blocks indefinitely. Our calling thread never returns.

What if any errors do you get if your try to signal this execution via code or cli? just try a bogus signal, workflow impl does not have to have a signal handler registered.

Hi @tihomir

Apologies for late reply.

Signalling the workflow returned the following:

Signal workflow succeeded

But the workflow was still in zombie mode after this.

I also tried terminating the workflow using the temporal CLI, but got a nil pointer dereference error.

I then tried terminating using the REST API and it worked. You can see the workflow on the UI here:

Note that prior to terminating the workflow, it was not findable on the UI.

We’re using self-hosted with Postgres on RDS.

I did some more digging and discovered that when the zombie workflow was first created, we were seeing a lot of errors in our Server History cluster. Those errors painted a common theme:

  • error code: Unavailable
  • error: shard status unknown

Here’s an example:

Persistent store operation Failure

go.temporal.io/server/common/log.(*zapLogger).Error
	/home/runner/work/docker-builds/docker-builds/temporal/common/log/zap_logger.go:154
go.temporal.io/server/service/history/workflow.createWorkflowExecution
	/home/runner/work/docker-builds/docker-builds/temporal/service/history/workflow/transaction_impl.go:379
go.temporal.io/server/service/history/workflow.(*ContextImpl).CreateWorkflowExecution
	/home/runner/work/docker-builds/docker-builds/temporal/service/history/workflow/context.go:346
go.temporal.io/server/service/history/api/startworkflow.(*Starter).createBrandNew
	/home/runner/work/docker-builds/docker-builds/temporal/service/history/api/startworkflow/api.go:296
go.temporal.io/server/service/history/api/startworkflow.(*Starter).Invoke
	/home/runner/work/docker-builds/docker-builds/temporal/service/history/api/startworkflow/api.go:214
go.temporal.io/server/service/history.(*historyEngineImpl).StartWorkflowExecution
	/home/runner/work/docker-builds/docker-builds/temporal/service/history/history_engine.go:416
go.temporal.io/server/service/history.(*Handler).StartWorkflowExecution
	/home/runner/work/docker-builds/docker-builds/temporal/service/history/handler.go:622
go.temporal.io/server/api/historyservice/v1._HistoryService_StartWorkflowExecution_Handler.func1
	/home/runner/work/docker-builds/docker-builds/temporal/api/historyservice/v1/service_grpc.pb.go:1571
go.temporal.io/server/common/rpc/interceptor.(*RetryableInterceptor).Intercept.func1
	/home/runner/work/docker-builds/docker-builds/temporal/common/rpc/interceptor/retry.go:62
go.temporal.io/server/common/backoff.ThrottleRetryContext
	/home/runner/work/docker-builds/docker-builds/temporal/common/backoff/retry.go:89
go.temporal.io/server/common/rpc/interceptor.(*RetryableInterceptor).Intercept
	/home/runner/work/docker-builds/docker-builds/temporal/common/rpc/interceptor/retry.go:66
google.golang.org/grpc.getChainUnaryHandler.func1
	/home/runner/go/pkg/mod/google.golang.org/grpc@v1.67.1/server.go:1211
go.temporal.io/server/common/rpc/interceptor.(*RateLimitInterceptor).Intercept
	/home/runner/work/docker-builds/docker-builds/temporal/common/rpc/interceptor/rate_limit.go:88
google.golang.org/grpc.getChainUnaryHandler.func1
	/home/runner/go/pkg/mod/google.golang.org/grpc@v1.67.1/server.go:1211
go.temporal.io/server/common/rpc/interceptor.(*TelemetryInterceptor).UnaryIntercept
	/home/runner/work/docker-builds/docker-builds/temporal/common/rpc/interceptor/telemetry.go:196
google.golang.org/grpc.getChainUnaryHandler.func1
	/home/runner/go/pkg/mod/google.golang.org/grpc@v1.67.1/server.go:1211
go.temporal.io/server/service.GrpcServerOptionsProvider.getUnaryInterceptors.NewServerMetricsTrailerPropagatorInterceptor.func5
	/home/runner/work/docker-builds/docker-builds/temporal/common/metrics/grpc.go:112
google.golang.org/grpc.getChainUnaryHandler.func1
	/home/runner/go/pkg/mod/google.golang.org/grpc@v1.67.1/server.go:1211
go.temporal.io/server/service.GrpcServerOptionsProvider.getUnaryInterceptors.NewServerMetricsContextInjectorInterceptor.func4
	/home/runner/work/docker-builds/docker-builds/temporal/common/metrics/grpc.go:65
google.golang.org/grpc.getChainUnaryHandler.func1
	/home/runner/go/pkg/mod/google.golang.org/grpc@v1.67.1/server.go:1211
go.temporal.io/server/common/rpc.ServiceErrorInterceptor
	/home/runner/work/docker-builds/docker-builds/temporal/common/rpc/grpc.go:157
google.golang.org/grpc.NewServer.chainUnaryServerInterceptors.chainUnaryInterceptors.func1
	/home/runner/go/pkg/mod/google.golang.org/grpc@v1.67.1/server.go:1202
go.temporal.io/server/api/historyservice/v1._HistoryService_StartWorkflowExecution_Handler
	/home/runner/work/docker-builds/docker-builds/temporal/api/historyservice/v1/service_grpc.pb.go:1573
google.golang.org/grpc.(*Server).processUnaryRPC
	/home/runner/go/pkg/mod/google.golang.org/grpc@v1.67.1/server.go:1394
google.golang.org/grpc.(*Server).handleStream
	/home/runner/go/pkg/mod/google.golang.org/grpc@v1.67.1/server.go:1805
google.golang.org/grpc.(*Server).serveStreams.func2.1
	/home/runner/go/pkg/mod/google.golang.org/grpc@v1.67.1/server.go:1029

I correlated this with our RDS metrics and its CPU was sitting at 99.5% at the time. While this is definitely a contributing factor, I don’t understand why it would result in a zombie workflow, as this suggests some kind of partial failure, rather than an all-or-nothing failure result.

We’re looking into scaling up our DB, but we’d still like to understand why we end up in this inconsistent zombie state, and whether this is a bug in our code or the Temporal SDK/server.

Were you able to unblock it with mentioned commands? If not and your latest finding i think might be data loss where you had shard movement that was also caused by db issues.

I was able to resolve it by terminating the workflow using the REST API.

How is it that data loss can occur? Is there any way to prevent it?

Would focus on your persistence errors around the time when execution was created.
Especially on history and matching service.
If you can share those graphs could look at see if anything specific stands out.

Another thing maybe to eliminate is use of worker versioning. If you for example started execution for some build id and didnt have workers for it, could lead to similar situation.

Given that you are able to still interact with this execution
yeah id try the workflow refresh (think you did terminate instead) but if workflow refresh does not help and you can eliminate fully maybe workers having some really high schedule to start latencies or not being able to drain backlog for long time, and you can eliminate worker versioning from being used/be part of equation, then think we are looking at potential task loss issue where yeah having persistence errors can help